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In This Issue 


This issue of Survey Methodology contains the first in an annual invited paper series in honour 
of Joseph Waksberg. A brief description of the newly instituted series and a biography of Joseph 
Waksberg are given before the paper itself. I would like to thank Danny Levine for writing the 
biography of Joseph Waksberg. I would also like to thank David Binder, Paul Biemer, Graham 
Kalton, and Chris Skinner, the current members of the Committee for choosing a very prominent 
survey researcher to author the first paper of the Waksberg Invited Paper Series. My special thanks 
are due to Graham Kalton who, as the founding Chairman of the Committee, took the lead, 
negotiated the necessary arrangements with Westat, the American Statistical Association and Survey 
Methodology to set the wheel in motion and worked hard to meet the deadline set by the journal for 
publication of the June Issue. 

The author of the Waksberg Invited Paper for 2001 is Gad Nathan. His paper, “Telesurvey 
Methodologies for Household Surveys — A Review and Some Thoughts for the Future”, presents a 
methodological history of telephone surveys from the 1930s up to the present day. Topics covered 
include sampling designs, sampling frames, coverage, nonresponse and weighting. He finishes the 
paper by describing some of the challenges and opportunities posed by more recent developments 
such as email, the internet, cell phones, and other emerging technological and social changes. 

This issue of Survey Methodology also includes a special section on composite estimation with 
four papers. The first of these papers, by A.C. Singh, Kennedy and Wu, describes the method of 
regression composite estimation developed by Singh and colleagues over the past few years. They 
compare the new approach to previous methods of composite estimation, most notably the K- 
composite and the A4K-composite estimators. The paper also includes a heuristic description and 
motivation of the new approach. Advantages of the new approach are that it yields a single set of 
estimation weights, leading to internal consistency of estimates, while improving on the efficiency 
of conventional regression estimators. 

Fuller and Rao give an analytical evaluation of the properties of regression composite estimation. 
They first describe two earlier variants of regression composite estimation called modified regression 
estimators (MR1 and MR2), and analyse the efficiency and behaviour of the estimates over time 
using a simple time series model for the survey panel estimates. They conclude that a modification 
which can be viewed as a compromise between MR1 and MR2 would have the best properties 
overall. 

In his paper, Bell compares a range of alternative estmators for use in the Australian Labour Force 
Survey. Estimators considered include the AK-composite estimator, the early variant of regression 
composite estimation called MR2, Fuller and Rao’s variant of regression composite estimation, and 
a BLUE estimator chosen as an “optimal” linear combination of panel estimates. An improved 
BLUE, obtained by calibrating the BLUE estimator to some population benchmarks, is also 
proposed. These estimators are compared in terms of their differences from the conventional 
regression estimator, their standard errors, and their usefulness for seasonal adjustment and trend 
estimation. 

The final paper of the special section, by Gambino, Kennedy and M.P. Singh, describes the 
regression composite estimator that was implemented for the Canadian Labour Force Survey. This 
estimator is based on the work of A.C. Singh and colleagues and the compromise suggested by 
Fuller and Rao. The new estimators are compared to the previously used regression type estimators 
for a number of series. They find that the new estimators are usually more efficient and stable, and 
more often allow succesful seasonal adjustment of the estimate series. 

Kim proposes a new method for variance estimation that accounts for random imputation based 
on a linear regression imputation model. The method is based on creating a set of pseudo-values for 
y, such that a conventional variance estimator based on these pseudo-values also accounts for the 
imputation. Calculation of the pseudo-values is described first for simple random sampling and then 
for complex designs. The approach is shown to be asymptotically equivalent to the adjusted 
jackknife of Rao and Sitter, and properties are investigated in a simulation study. 


In This Issue 


Raghunathan, Lepkowski, Van Hoewyk and Solenberger in “A Multivariate Technique for 
Multiply Imputing Missing Values Using a Sequence of Regression Models” address the important 
issue of imputing into a complex data structure where explicit full multivariate models cannot be 
easily constructed. They adopt the approach of imputing on a variable by variable basis conditioned 
on all the observed variables. This implies that the imputations are created through a sequence of 
multiple regressions that vary depending on the type of variable being imputed. 

In their article, Dufour, Gagnon, Morin, Renaud and Sarndal propose a measurement of distance 
which can be used to measure the relative incidence of the nonresponse adjustment, calibration and 
the interaction between these two procedures. This measurement enables them to study and measure 
the change (from the initial to the final weight) resulting from the weight modification procedure. 
They use this measurement as a tool to compare the effectiveness of various non-response 
adjustment methods through a simulation study applied to the data from the Survey of Labour and 
Income Dynamics. The measurement is also applied to data from the National Longitudinal Survey 
of Children and Youth. 

In recent years there has been an increasing number of attempts to survey homeless people in 
major cities. The difficulty of constructing a reliable and efficient survey frame and sampling 
method, and the fluidity of the population over time make surveying of this population particularly 
difficult. The final paper of this issue, by Ardilly and Le Blanc, describes sampling and estimation 
for a current survey of homelessness in France. Problems and challenges particular to this type of 
survey are also described. The proposed survey will sample homeless individuals indirectly by 
sampling the services such as shelters and meal services which they may use. The weight-share 
method is shown to be an effective way to obtain unbiased weights for different periods of time such 
as an average day or an average week. 

Finally, I would like to take this opportunity to express my sincere thanks to Frank Mayda, 
Production Manager of Survey Methodology, who recently retired. His involyment with Survey 
Methodology since 1987 has been invaluable. I would also like to announce that Eric Rancourt has 
replaced Frank Mayda as Production Manager. 


M.P. Singh 
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Waksberg Invited Paper Series 


Survey Methodology has established an annual invited paper series in honor of Joseph Waksberg, who has made many 
important contributions to survey methodology. Each year, a prominent survey researcher will be chosen to author a paper 
that will review the development and current state of a significant topic in the field of survey methodology. The author 
teceives a cash award, made possible through a grant from Westat in recognition of Joe Waksberg’s contributions during 
his many years of association with Westat. The grant is administered financially and managed by the American Statistical 
Association. The author of the paper is selected by a four-person committee appointed by Survey Methodology and the 


American Statistical Association. 


JOSEPH WAKSBERG 


Joseph Waksberg (known universally as “Joe’’) currently 
is Chair of the Board of Directors of Westat, a statistical 
research firm located in Rockville, MD. Throughout a 
career that now spans more than 60 years, he has made 
important contributions to sampling theory, developed inno- 
vative applications of the theory, and conducted research in 
a broad array of survey methodology issues. He is author or 
co-author of numerous papers on sampling methods, 
including random digit dialing, sampling for rare popula- 
tions, sampling for panel and rotating design surveys, and 
the role of sampling in population censuses. Additional 
contributions have ranged from methodological research on 
labor force measurement, evaluation of the quality of U.S. 
censuses, the effects of telescoping and other problems of 
recall on survey results, research on the effects of cash 
incentives on response rates and survey costs, small area 
estimation, and the development of models to estimate 


election night results. His goal has been to improve both 
survey theory and practice. Last, but not least, he has been 
teacher and mentor to generations of statisticians. 

Born in Kielce, Poland in September 1915, Joe 
immigrated with his family to the United States in 1921. 
Shortly after graduating from the City University of New 
York (CUNY) in 1936 with a degree in mathematics, he 
moved to the Washington D.C. area and, after a brief stint 
with the Navy Department, joined the Census Bureau in 
1940 as a clerk. He remained at the Census Bureau for 33 
years, retiring in 1973 as Associate Director for Statistical 
Methods, Research, and Standards. In the early 1960’s, 
Waksberg, in association with Neter, initiated a classic 
study on the magnitude of various types of memory recall 
problems. This landmark effort led to procedures for 
reducing the effects of recall problems through both an 
innovative sampling and data collection approach (Neter 


4 


and Waksberg 1964; Neter and Waksberg 1965). Joe’s 
interest in this area has continued; for example, he helped 
design and analyze results from an experiment to measure 
the direction and magnitude of possible biases from a one 
year recall survey for the U.S. Fish and Wildlife Service 
(Chu, Eisenhower, Hay, Morganstein, Neter and Waksberg 
1992). The results of that experiment had a substantial 
effect on the redesign of the survey. More importantly, the 
work also added significantly to knowledge about 
respondent bias when respondents are asked to recall the 
frequency of activities under varying recall periods, and 
indicated methods of minimizing the mean square errors in 
the design of such surveys. 

The current stature of the U.S. Current Population 
Survey (CPS) as a model of statistical efficiency fully 
reflects his influence and contributions while in charge of 
sampling, statistical standards, and research for the Census 
Bureau’s household survey program. Notable among the 
changes introduced during his tenure which bear his imprint 
are the improved methods of sample selection and estima- 
tion, including the use of list sampling, replication 
variances, determination of appropriate cluster size, treat- 
ment of rare events, and composite estimation. At the same 
time, he played a major role in the experimental research 
carried out on alternative rotation and estimation patterns, 
on the use of a single household respondent, and on the 
effects of variable recall periods on labor force 
measurement. 

No discussion of Joe’s stay at the Census Bureau is 
complete without some reference to his many contributions 
to the decennial census programs. A good example is the 
evaluation program for the 1970 Census, which Waksberg 
developed, designed, and directed. Consisting of a series of 
25 separate projects, it was considered at that time as 
“radical”; today that program stands as the model for 
ongoing programs of decennial census research. When 
early field returns in the 1970 Census showed a serious 
overstatement in the reporting of “vacant” units, Waksberg 
designed, developed, and implemented, under great time 
constraints, an innovative sample survey program which 
revisited a sample of vacant units to estimate the proportion 
occupied. An adjustment procedure was then developed and 
applied, at the small area level, to the universe of vacant 
units identified in the census (Waksberg 1998). 
Subsequently, with the introduction of Revenue Sharing 
legislation in 1972, with its requirement that the Bureau 
produce annual estimates of population and per capita 
income for all 39,000 governmental units in the U.S., 
Waksberg proposed using administrative records in concert 
with survey data to provide the required local area estimates 
of population and per capita income. He initiated research 
on matching IRS records for adjacent years in order to 
obtain small-area (county) estimates of gross and net 
migration and changes in income levels, research that led to 
the development and implementation of a small area 
estimation program that is basically still in use today. 


Waskberg Invited Paper Series 


Waksberg’s years at Westat, which began in 1973, first 
as Senior Statistician and Vice President, and recently as 
in-house consultant and Chair of the Board, have shown the 
same dedication to innovation, experimentation, and quality 
in meeting the needs of its clients and in developing 
samples and carrying out survey research. In assisting the 
National Center for Health Statistics in designing samples 
for both the National Health Interview Survey and the 
National Health and Nutrition Examination Survey, he 
made major contributions to innovative methods for effi- 
cient oversampling of minority populations, by following 
up work he had done earlier on this subject (Wasksberg 
1973). His work with Judkins and Massey provides 
important information on residential concentrations by race 
and ethnic origin, essential to assessing the usefulness of 
oversampling geographical areas for minority populations, 
and persons in poverty, another subpopulation for which 
oversampling is often required (Waksberg, Judkins and 
Massey 1997). He was a co-developer of the Mitofsky- 
Waksberg method of two-stage sampling of telephone 
households (Waksberg 1978), which became the standard 
approach for RDD sampling in the United States. 
Waksberg continued to explore ways of improving RDD 
sampling by examining the bias from list-assisted samples 
(Waksberg 1983; Brick and Waksberg 1991), which have 
resulted in modifications and improved efficiencies of the 
method and, subsequently, to a completely different method 
of RDD sampling (Brick, Waksberg, Kulp and Starer 
1995). More recently, he participated in an examination of 


’ alternative ways of adjusting for households lacking 


telephones (Brick, Waksberg and Keeter 1996). His work 
in RDD sampling clearly demonstrates his life-long desire 
to constantly reexamine statistical approaches and find new 
methods to improve upon or even replace the standards, 
including those he helped establish. 

Mr. Waksberg has shared his knowledge and expertise 
in a wide range of venues outside his office. For many 
years, he taught at the Graduate School of the U.S. 
Department of Agriculture, and was a regular lecturer at the 
University of Michigan summer program in sampling 
methods. He also has been a frequent consultant on 
sampling and survey techniques to governmental statistical 
organizations throughout the world, through the sponsor- 
ship of the U.S. Agency for International Development and 
the United Nations, as well at the request of individual 
countries, and has provided advice to the statistical offices 
of China, Argentina, Brazil, Cuba, Venezuela, Turkey, and 
South Vietnam. He has also represented the United States 
at international statistical meetings, served as technical 
expert under UN auspices, and been a member of a team 
sent to South America by the American Statistical 
Association to coordinate activities of their national 
Statistical societies. 

He is a member of the American Statistical Association, 
of which he has been elected Fellow, the International 
Association of Survey Statisticians, and the International 
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Statistical Institute, and has served as a member of various 
panels of the National Academy of Sciences to evaluate 
specific Federal Statistical programs. He was the first 
recipient of the Roger Herriot Award, awarded by the 
Washington Statistical Society and the ASA Sections on 
Government Statistics and on Social Statistics for 
“innovation in federal statistics”, and is a recipient of the 
Gold Medal Award of the U.S. Commerce Department. 
Finally, his greatest impact may be through the large 
number of colleagues who were inspired in their own 
efforts by his personal example, by his teaching, by his 
leadership, and by his kindness, thoughtfulness, and 
understanding. 
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Author: Gad Nathan 


Gad Nathan is Professor of Statistics at the Hebrew University of Jerusalem and has long been associated with the Israel 
Central Bureau of Statistics, most recently as Chief Scientist. He received his Ph.D. from Case Institute of Technology, 
Cleveland OH and has published numerous papers in leading statistical journals, including Journal of the American 
Statistical Association, Journal of the Royal Statistical Society, Survey Methodology, Journal of Official Statistics and 
Sankhya. His main research areas are sampling methodology, inference from complex samples, computer assisted 
interviewing and telesurveys. He has held visiting and consulting positions at several academic institutions and statistical 
agencies in North America and in Europe and has served as Vice-President of the International Statistical Institute and of 
the International Association of Survey Statisticians, as well as President of the Israel Statistical Association and Chairman 
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Telesurvey Methodologies for Household Surveys — 
A Review and Some Thoughts for the Future 


GAD NATHAN’ 


ABSTRACT 


We consider ‘telesurveys’ as surveys in which the predominant or unique mode of collection is based on some means of 
electronic telecommunications — including both the telephone and other more advanced technological devices such as 
e-mail, Internet, videophone or fax. We review, briefly, the early history of telephone surveys, and, in more detail, recent 
developments in the areas of sample design and estimation, coverage and nonresponse and evaluation of data quality. All 
these methodological developments have led the telephone survey to become the major mode of collection in the sample 
survey field in the past quarter of a century. Other modes of advanced telecommunication are fast becoming important 
supplements and even competitors to the fixed line telephone and are already being used in various ways for sample surveys. 
We examine their potential for survey work and the possible impact of current and future technological developments of 
the communications industry on survey practice and their methodological implications. 


KEY WORDS: Telephone surveys; Internet surveys; Sample design; Nonresponse; Coverage. 


1. INTRODUCTION 


Electronic telecommunications have become a predom- 
inant factor in practically all aspects of modern life at the 
beginning of the new millennium. Sample surveys are no 
exception and the widespread use of the telephone as a 
prime mode of communication for at least the past quarter 
of a century has had an important influence on survey 
practice. In fact, the telephone survey has become the major 
mode of collection in the sample survey field, especially in 
North America and Western Europe, both for surveys of 
households and individuals and for surveys of establish- 
ments. Other modes of advanced telecommunication, such 
as e-mail, Internet, videophone, fax and mobile phones are 
fast becoming important supplements and even competitors 
to the fixed line telephone. They are already being used in 
various ways for sample surveys and in this review paper 
we intend to examine their potential for survey work and 
the methodological implications of their use. We therefore 
wish to use the term ‘telesurvey’ for any survey in which 
the predominant or unique mode of collection is based on 
some means of electronic telecommunications — including 
both the telephone and other more advanced technological 
devices. Conventional surveys based on face-to-face inter- 
views in the home or (snail-)mail surveys are not included, 
unless a substantial component of the survey is based on 
some telecommunications instrument. Although this paper 
focuses on surveys of individuals and households, much of 
it is relevant to establishment surveys too. We refer to 
telesurvey ‘methodologies’ in the plural, since it seems 
obvious that no single methodology will be suitable for use 
with the plethora of possible communication devices avail- 
able in the future and their combinations. 

This paper has been prepared in recognition of Joe 
Waksberg’s unique contributions to survey methodology, 
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in general, and to telephone survey methodology in parti- 
cular. It is well recognized today that his groundbreaking 
paper, Waksberg (1978), paved the way for the widespread 
efficient use of random digit dialing for telephone surveys 
and serves as a threshold point in the development of tele- 
survey methodology. Together with many of his subsequent 
papers, his work has had a profound influence on the theory 
and practice of telephone survey methodology, some of 
which will be examined in this paper. 

We shall deal primarily with the statistical aspects of 
telesurvey methodology but recognize that these are not 
independent of non-statistical aspects, such as the cognitive 
features of telesurvey interviewing, survey administration 
and ethical considerations. In the following section we 
briefly review the early history of telephone surveys, 
through 1978. Section three reviews in some detail more 
recent developments in the areas of sample design and esti- 
mation, coverage and nonresponse and evaluation of data 
quality. Finally in section four we consider the possible 
impact of current and future technological developments of 
the communications industry on survey practice and their 
methodological implications. 


2. THE EARLY HISTORY OF TELEPHONE 
SURVEYS 


In the following we review briefly the overall early 
development of the use of telephones for survey work, as 
background for the developments in telesurvey methodolo- 
gies to be described later. More detailed and comprehensive 
coverage is provided in several books and survey papers, 
e.g., Blankenship (1977a), Groves, Biemer, Lyberg, 


. Massey, Nicholls and Waksberg (1988), Frey (1989), 
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Lavrakas (1993), Casady and Lepkowski (1998, 1999) and 
Dillman (1978, 2000). 

Telephones have been used for survey work since the 
thirties, though generally as a supplementary mode of 
collection. Some have erroneously blamed the disastrous 
failure of the Literary Digest survey’s prediction of a land- 
slide victory of Landon over Roosevelt in 1936, at least 
partially, on telephone undercoverage (Katz and Cantril 
1937; Payne 1956; and Perry 1968). In fact the survey was 
based on mail questionnaires and although telephone lists 
were used as a sampling frame (in combination with lists of 
automobile registrations), it seems that the failure was due 
more to nonresponse than to frame undercoverage (Bryson 
1976; Squire 1988; and Cahalan 1989). 

Most of the earliest reports on the use of the telephone in 
survey work were in the areas of public health or in market 
research applications. Many of them used some combina- 
tion of telephone interviewing with other modes of collec- 
tion and in some cases they included empirical comparisons 
of response rates or outcomes in order to assess mode 
effects. For instance, Cunningham, Westerman and Fischoff 
(1956) and Bennet (1961) report on telephone surveys for 
follow-up studies of patient treatment and Fry and McNaire 
(1958) on a national follow-up to a mail questionnaire to 
obtain opinions of hospital staff — all with high response 
rates. Mitchell and Rogers (1958) used telephone inter- 
viewing for a survey of telephone households on the con- 
sumption of dairy products and compare the results with 
those obtained from a control sample of non-telephone 
households. Cahalan (1960) compares results from tele- 
phone interviews with those from personal interviews in 
measuring newspaper readership with favourable results. 
Eastlack (1964) in a comparative telephone study of 
advertising recall and product usage shows that a rigorous 
call-back protocol provides more accurate results than a 
method without call-backs. Coombs and Freedman (1964) 
report on high telephone response (92%) in a longitudinal 
fertility survey, supplemented by personal interviews. 
Sudman (1966) describes several supplementary uses of the 
telephone for survey work, which include making of 
advance appointments and screening for rare populations, 
with positive results for cooperation rates and cost 
reductions. 

In the late sixties telephone surveys really came of age, 
as a result of several different developments. First of all the 
rapid increase in telephone coverage in Western Europe and 
North America implied that telephone interviewing could 
be used as a primary mode of collection. In the US 
household telephone coverage reached a level of 88%. in 

1970 (Massey and Botman 1988) and this level was reached 
somewhat later in most Western European countries, in 
Australia and in New Zealand (Trewin and Lee 1988). In 
parallel to the rapid increase in telephone penetration in 
many countries a serious decline in response rates and 
difficulties in contacting respondents by door-to-door 
collection were experienced in the late sixties. This led to 
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serious consideration of telephone surveys both to reduce 
costs and to achieve higher cooperation rates. The use of 
telephone interviewing advanced most rapidly in 
commercial and academic survey organizations and less so 
in official government statistics. For instance the Federal 
Committee on Statistical Methodology (1984) reports that 
only about 11 percent of US Federal surveys in 1981 
involved telephone interview in any form, in most cases in 
addition to other modes. 

At first telephone interviewing was viewed with appre- 
hension, even when used only as a supplementary mode of 
collection, due to fears of high nonresponse rates and 
response biases considered inherent when interviewing was 
not carried out face-to-face. Results of some of the earlier 
telephone surveys seemed to reinforce these fears. For 
instance, a study of leaflet receipt by Larson (1952) raises 
serious doubts on the validity of telephone responses on the 
basis of a face-to-face interview follow up. Similarly Oakes 
(1954) reports on suspiciously lower response on improve- 
ments to a consumer service via the telephone than obtained 
in face-to-face interviews. Schmiedeskamp (1962) in an 
attitude survey on consumer finances finds greater avoid- 
ance of taking strong positions when telephone inter- 
viewing was used. Wiseman (1972) in a comparison of mail 
questionnaire, telephone and face-to-face personal inter- 
viewing finds mode effects for sensitive issues (abortion 
and birth control). The main differences, however, are 
between responses to mail questionnaires and to personal 
interview (telephone or face-to face). 

Many of these fears were allayed at an early stage by the 
results of a number of more rigorous empirical studies. 
Thus Hochstim (1967) in a well-designed controlled experi- 
ment compares collection by mail, telephone and personal 
interview as the primary mode of collection. The results 
demonstrate convincingly that the three strategies of data 
collection prove to be practically interchangeable when 
compared with respect to rate of return, completeness of 
return, comparability of findings and validity of responses. 
The major difference between modes is with respect to cost, 
with a clear preference for the mail or telephone strategy. 
Similarly a small test carried out by Colombotos (1965) on 
samples of a population of physicians shows no signif- 
icance differences between responses obtained by telephone 
and by in-person interviews. Janofsky (1971) reports simi- 
larity in willingness to express feelings on health issues 
between telephone respondents and face-to-face interview 
respondents. A well designed validation study by Locander, 
Sudman and Bradburn (1976) of the effects of question 
threat and mode of collection found no meaningful differ- 
ences in response bias between telephone and face-to-face 
interviews. Finally, in a small carefully controlled field 
experiment, Rogers (1976) tested the effects of alternative 
interviewing strategies on the quality of responses and on 
field performance in a survey on a variety of complex 
attitudinal, knowledge and personal items. The results again 
indicate that the quality of data obtained by telephone is 
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comparable to that obtained by interviews in person. A 
major national study comparing telephone and face-to-face 
interviewing was conducted by Groves and Kahn (1979). It 
was based on an intensive analysis of the large omnibus 
surveys carried out under the two modes by the University 
of Michigan Survey Research Center. It provided important 
information on data quality which did not indicate any 
substantial mode effects. These and other early studies, 
which foreshadowed several systematic studies of mode 
effects carried out in the eighties and nineties (to be 
discussed later) contributed to the legitimacy of telephone 
surveys as a standard mode of collection. 

The initial use of telephones for sample surveys was usu- 
ally based on samples selected from general frameworks, 
such as telephone directories, or from specific frameworks 
for small sub-populations. Towards the end of the sixties 
there was increased awareness of high rates of unlisted 
telephone numbers and of substantial differences between 
households with listed and non-listed numbers (see details 
in section 3.1.1). An important development that overcame 
this problem was the sampling method of Random Digit 
Dialing (RDD), first introduced by Cooper (1964) and 
further improved and developed by Eastlack and Assael 
(1966) and by Glasser and Metzger (1972). An inherent 
inefficiency of these basic element RDD methods was the 
large amount of numbers to be called that did not yield an 
interview (non working and non residential numbers). A 
two-stage RDD sampling method was first proposed to deal 
with this problem by Mitofsky (1970) and subsequently 
elaborated and put on a firm theoretical basis by Waksberg 
(1978). The introduction of what was to become known as 
the Mitofsky-Waksberg scheme contributed greatly to the 
widespread use of telephone surveys in the eighties and 
nineties. 

Finally the technological advances in telecommunica- 
tions and automation in the sixties and seventies contributed 
to the advantages of telephone surveying. Universal direct 
long distance dialing enhanced the possibilities of carrying 
out national surveys from a single center or from a small 
number of interviewing centers with all the advantages of 
central control and administration. However the greatest 
impact on the expansion of telephone surveys has undoubt- 
edly been the introduction of Computer Assisted Telephone 
Interviewing (CATI) in the seventies. This is due both to 
the simplicity of CATI for conducting telephone interviews 
and to the possibilities it offers for the use of automation in 
many important non-interviewing tasks, (e.g., dialing, recall 
schedules etc.). 

One of the first uses of the computer for on-line ques- 
tioning was in the form of a multi-station computer-based 
laboratory experiment designed to elicit subjective infor- 
mation — Shure and Meeker (1970). A good account of the 
early history of CATI can be found in the special issue of 
Sociological Methods and Research (Freeman and Shanks 
1983), following the Berkeley Conference on Computer- 
Assisted Survey Technology held in Spring 1981. Market 


research organizations were the first to introduce CATI 
systems for their current operations. Chilton Research 
Services developed and used the Survey Response 
Processor on a current basis already in 1972 — Fink (1983). 
Other commercial survey organizations, applying different 
systems, realized early on the advantages of CATI — for 
instance the A&S/CATI™ system (Dutka and Frankel 
1980). Academic survey research organizations were quick 
to follow with the earliest systems developed at UCLA and 
Berkeley for the large scale CATI-based California 
Disability Survey — Shanks, Nicholls and Freeman (1981) 
and Shanks (1983). Another early development of a CATI 
system at an academic survey organization, using a 
different approach, based on microcomputers, was that of 
the University of Wisconsin (Palit 1980; Palit and Sharp 
1983). In Europe the first survey research organizations to 
use CATI were Social and Community Planning Research 
(SCPR — now the National Centre for Social Research) in 
the UK (Sykes and Collins 1987) and the State University 
of Utrecht, Netherlands (Dekker and Dorn 1984). The 
introduction of CATI systems into official statistics was 
slower. In the US it started in 1982 at the Census Bureau 
(Nicholls 1983) and at the National Agricultural Statistics 
Service (Tortora 1985) and at the same time at Statistics 
Netherlands (1987). By 1987 practically all organizations 
surveyed in a (non-probability) sample of 27 survey organi- 
zations (eighteen in the US and nine elsewhere) were using 
CATI for some or all of their telephone surveys — Berry and 
O’ Rourke (1988). A report of the Federal Committee on 
Statistical Methodology (1990) indicated that the number of 
CATI installations worldwide at the end of the eighties was 
estimated to exceed 1,000 and that in 1988, the U.S 
Government had 51 cooperating CATI centers. It should be 
noted that the development of CATI quickly became part of 
a wider movement toward computer assisted interviewing 
(CAI) or computer assisted information collection 
(CASIC), which includes also CAPI (Computer Assisted 
Personal Interviewing) and CASI (Computer Assisted Self 
Interviewing) — Nicholls (1988). A more complete history 
of the development of CATI and of CASIC, in general, can 
be found in Couper and Nicholls (1998). 


3. RECENT DEVELOPMENTS IN 
TELEPHONE SURVEYS 


In the last quarter of a century telephone surveying has 
definitely come of age. Lyberg and Kasprzyk (1991) claim 
that it has become the dominant mode of collection in 
countries with extensive telephone coverage. 

Hundreds of scientific papers have been published 
during this period on a wide range of different aspects of 
telephone surveys. Several general books on the subject 
have appeared — Blankenship (1977a), Groves and Kahn 
(1979), Frey (1989) and Lavrakas (1993). A number of 
conferences have been devoted to telephone survey 


10 Nathan: Telesurvey Methodologies for Household Surveys - A Review and Some Thoughts for the Future 


methodology or have dealt with specific aspects of the 
topic. The results have appeared in monographs or special 
issues of scientific journals. A major conference on tele- 
phone survey methodology was held in November 1987 in 
Charlotte, NC, with the resulting volume edited by Groves, 
Biemer, Lyberg, Massey, Nicholls and Waksberg (1988) 
and the special issue of the J ournal of Official Statistics, 
edited by Groves and Lyberg (1988b). The Berkeley 
Conference on Computer-Assisted Survey Technology held 
in Spring 1981 (Freeman and Shanks 1983) dealt primarily 
with telephone surveys and CATI was a major topic at the 
InterCASIC ’96 International Conference on Computer 
Assisted Survey Information Collection, held in San 
Antonio, TX in December 1996 (Couper, Bethlehem, 
Baker, Clark, Martin, Nicholls and O’Reilly 1998) and at 
the ASC 3" International Conference at Edinburgh in 
September 1999 (Banks, Christie, Currall, Francis, Harris, 
Lee, Martin, Payne and Westlake 1999). 

Extensive bibliographies with several hundred entries 
can be found in the above sources, as well as in Khurshid 
and Sahai (1995), which covers the period through 1991, 
and in Survey Research Center (2000), which updates 
previous bibliographies with respect to sample design for 
household telephone surveys through 2000. 

In the following we review the development of telephone 
survey methodology for household surveys during the past 
25 years in the areas of sample design and estimation, 
coverage and nonresponse and evaluation of data quality. 


3.1 Sample Design and Estimation 


Sampling methodology for telephone surveys is based on 

the general principles of sampling. It is primarily adapted to 

_ the special situation of telephone surveys with respect to the 

sampling framework used. Thus we adopt the classification 

proposed by Lepkowski (1988) for telephone sampling 

methods, according to the underlying sampling framework 

— directory and commercial lists, telephone numbers (RDD) 
and combined methods (list-assisted and dual frame). 


3.1.1 List-based Sampling Procedures 


As mentioned above, the earliest telephone surveys were 
all based on samples selected from lists. In many cases they 
were mixed-mode surveys where telephone interviewing 
was used to supplement for non-response in face-to-face 
interviews or for follow-up. Thus so-called’ warm telephone 
interviewing’ schemes have been used in the US Current 
Population Survey and in the Canadian Labour Force 
Survey — Drew, Choudhry and Hunter (1988). In these 
cases sampling is based on a general list framework to 
which information on telephone numbers is added and no 
special features of the use of the telephone are involved in 
the sample design. The same goes for ‘pure’ telephone 
surveys of special populations, such as physicians, for 
which a complete list of the population is available with 
telephone numbers and can be used as a sample framework 


— see, for example, Gunn and Rhodes (1981). Another 
example is where telephone interviewing is used in 
follow-up waves of a panel survey with the first contact 
carried out by a face-to-face interview. For instance in the 
Israel Labor Force Survey the first contact is by a home visit 
and the second and third waves are carried out by telephone 
for households who are willing to respond by telephone — 
Nathan and Eliav (1988). A related approach, used recently 
in a pilot study for the US National Study of Health and 
Activity (Maffeo, Frey and Kalton 2000), is to take an area 
sample, find telephone numbers where possible, for 
telephone interviewing, and use face-to-face interviewing 
for other households and for telephone nonrespondents. 

The most easily obtained and low-cost directory that can 
be used as a framework for a telephone surveys is, of 
course, the telephone directory itself, or some modification 
of it. Originally the paper version of the directory was used, 
while nowadays an electronic version would usually be 
available. The major deficiencies of the telephone directory 
as a sampling framework are well documented. They are 
undercoverage, overcoverage, duplication and lack of 
auxiliary information. Undercoverage is by far the most 
serious deficiency and includes both non-telephone house- 
holds and households with telephones unlisted by choice or 
those not yet included in the directory. The biases due to 
non-telephone households are, of course, irrespective of the 
framework used and will be dealt with in section 3.3. 

The extent of unlisted telephones varies considerably by 
country and type of location, as well as by other household 
variables. Sykes and Collins (1987) report on an unlisted 
rate of 4% in the Netherlands and 12% in the UK. Fréjean, 
Panzani and Tassi (1990) estimate the unlisted rate in 
France as 14% and national US estimates in the seventies 
were of over 17-19% (Blankenship 1977b and Glasser and 
Metzger 1975). Rich (1977) reports on increasing rates of 
nonpublished telephones (excluding those involuntarily 
unlisted) in the Pacific Telephone’s California serving area 
from 9% in 1964 to 28% in 1977. In addition some 5% of 
home telephones in California were estimated to be invol- 
untarily unlisted (assigned after publication of the direc- 
tory). More recent studies show substantially higher 
unlisted rates. Thus Genesys (1996) reports unlisted rates of 
40% in 1993 and of 37% in 1995, based on national 
samples of more than 100,000 RDD telephone interviews 
and Survey Sampling Inc. (1998) estimates the US national 
unlisted rate for 1997 at 30%. Results of a small-scale study 
of the Jerusalem area (Nathan and Aframian 1996) indicate 
an unlisted rate of 27%. 

Many studies have shown substantial differences be- 
tween listed and unlisted telephone household character- 
istics, indicating disturbing potential coverage biases for 
directory-based samples. In the US these differences were 
demonstrated, for instance, in a study by Brunner and 
Brunner (1971), who found highly significant differences 
between listed and unlisted telephone households with 
respect to a wide range of demographic and socio-economic 
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variables. Leuthold and Scheele (1971) found higher rates 
of nonlisting among blacks, city dwellers, young people, 
apartment dwellers, divorced and separated and among 
service workers. Similarly, Roslow and Roslow (1972) 
found significant differences in audience shares between 
listed and unlisted telephone households. Glasser and 
Metzger (1975) showed that nonlisted rates were higher in 
the West, in major metropolitan areas, among non-whites 
and the young. Blankenship (1977b) and Rich (1977) found 
highly significant differences between listed and unlisted 
households with respect to sex and age of household head, 
occupation, household size and income. In the UK Sykes 
and Collins (1987) found more unlisted numbers among the 
young, the poorest and those living in London. The results 
of Nathan and Aframian (1996) for the Jerusalem area 
showed lower rates of TV ownership and of TV viewing (of 
those with TV) in an RDD sample as compared with a 
directory listing sample. 

Besides the undercoverage resulting from unlisted 
numbers, as indicated above, directory listings also suffer 
from problems of overcoverage, duplication and lack of 
updated auxiliary information. Overcoverage occurs when 
a unit outside the population is included in the framework. 
This may be due to the fact that disconnected numbers often 
remain in the directory, commercial numbers are not always 
clearly designated as such or other cases of unrecognized 
ineligibility. Duplication occurs when the same unit is 
represented in the frame more than once and the duplication 
is not recognized. Duplication can usually be discovered 
during sampling if the entries for the same household are 
listed consecutively but not if they appear separately (e.g., 
under different surnames). If duplication is ascertained 
during the interview (i.e., by obtaining information on the 
number of connected lines available to the household or the 
number of directory listings) it can be dealt with by appro- 
priate weighting. Although these problems are sur- 
mountable, at a cost, that of undercoverage is not and this 
indicates the need for more representative sample frame- 
works than provided by directories. A popular alternative to 
the traditional telephone directory (in general prepared by 
the company providing telephone service to the area) has 
been the lists prepared by commercial firms, usually for 
purposes of marketing. These may be city directories, 
obtained from municipal address listings with telephone 
numbers obtained from directories or other sources, 
subscriber lists of telephone companies or national master 
address lists, such as that provided by Donelley Marketing, 
Inc. in the US — Lepkowski (1988). These lists provide 
important auxiliary data, such as geographic information, 
from the Census of Population and Housing and from other 
sources. They do not, in general, overcome the bias due to 
unlisted numbers and their cost may be high. They can 
result in some gain in sampling variance, due to the possi- 
bility of basing an efficient design on the auxiliary informa- 
tion. Potentially, lists used by emergency services to 
determine the physical location of callers could be used as 
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frameworks, although access to these lists would be 
difficult for non-government survey organizations. 


3.1.2 Random Digit Dialing — The Mitofsky 
-Waksberg Scheme 


In order to overcome many of the inherent problems of 
directories and commercial lists, Random Digit Dialing 
(RDD) methods have become a popular choice for tele- 
phone surveys, primarily in the US. These are based on the 
frame of all possible telephone numbers. The method was 
originally proposed by Cooper (1964), who added random 
four digit suffixes to known prefixes in a local survey. This 
basic element sampling method was further improved and 
developed by Eastlack and Assael (1966) and by Glasser 
and Metzger (1972), on a national level, by identifying 
‘working banks’ of numbers from telephone company 
information. 

The use of RDD has until recently been confined, by and 
large, to the US and Canada. Thus Sykes and Collins 
(1987) report that telephone surveys were still rare in the 
UK at the end of the eighties, primarily due to low tele- 
phone coverage. In particular RDD surveys were rarely 
used — one of the reasons being the lack of uniformity in the 
length of telephone numbers at the time. However recently, 
with the increase of telephone coverage in the UK to some 
96% at the end of the nineties and the standardization of 
telephone numbers to ten digits, RDD surveys have become 
more popular — see e.g., Collins (1999) and Nicolaas, Lynn 
and Lound (2000). Similarly, Gabler and Haeder (2000) 
report that an RDD method, modified in order to deal with 
varying telephone number lengths (from 6 to 11 digits!), is 
now standard procedure for telephone surveys in Germany. 

Mitofsky (1970) first proposed a two-stage RDD 
sampling method to deal with the problem of the inherent 
inefficiency of these basic element RDD methods due to the 
large amount of numbers to be called that did not yield an 
interview (non working and non residential numbers). This 
was subsequently elaborated and put on a firm theoretical 
basis by Waksberg (1978) and the method became known 
as the Mitofsky-Waksberg scheme. This scheme or varia- 
tions of it have become the predominant sampling method 
for telephone surveys, at least in the US. 

The method is based on the fact that household tele- 
phone numbers are, in general, clustered in series of conse- 
cutive numbers or within banks of numbers with the same 
first r digits. For the US r is usually set at eight (for ten digit 
telephone numbers, including area code), so that the banks 
or clusters (PSU’s) are of size N = 100 each. It is assumed 
that the telephone company can provide a list of all opera- 
ting prefixes (area code + first three digits), i.e., those to 
whom residential numbers have been assigned. To the six 
digit numbers in this list all possible choices of two digits 
are added, resulting in a sampling frame of eight digit 
numbers that represent the M PSU’s in the population. 
Sample PSU’s are selected from this frame at random (with 
replacement) consecutively and for each PSU selected two 
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final digits are selected at random. The resulting ten digit 
number is dialed and if the number is not that of a residence 
(according to the survey definition), the PSU is dropped 
from the sample. If it is a residence a simple random sample 
(without replacement) of k additional residential numbers 
is selected by contacting numbers selected at random 
(without replacement) from the PSU, until k additional 
residential numbers are obtained. The procedure of PSU 
selection continues until a fixed number of PSU’s, m, has 
been selected. It is easily seen that, assuming that the 
number of residential numbers in each selected PSU, P,, is 
at least k, the total sample size of residential telephone 
households is m ( k + 1) and that the final sample is an 
equal probability sample from the population of all 
residential telephone households. 

Waksberg (1978) shows that if we designate by: 
T= ow P,)/(NM) the proportion of residential numbers 
in the population and by t the proportion of PSU’s with no 
residential numbers (i.e., for which P; = 0), then the ex- 
pected number of total calls is given by: 
m[1+(1 -£)k]/m, assuming that P, > k+1 for all PSU’s 
with at least one residential number. The last assumption 
can be dropped if PSU’s are grouped so that the restriction 
holds in each group or if unequal weighting is used. 
Optimal values of the design parameters are obtained under 
a simple cost function and the method is extended to deal 
with repeated surveys. The main advantage of the method 
is the reduction in the expected number of calls which have 
to be made in order to attain a given effective sample size, 
especially if t, the proportion of PSU’s with no residential 
numbers, is larger than 0.5. Groves (1977) provides data for 
a national study indicating a value of t of about 0.65. This 
advantage has to be weighed against the increase in 
variance due to the effect of clustering. However, taking 
costs into account, illustrative calculations for typical values 
of the parameters show that reductions in costs run between 
20 and 40%. 

The major operational drawback of the method is in its 
sequential nature. This makes it unwieldy to carry out man- 
ually. However the sequential operation poses no problem 
when the process of selection is fully automated. The 
method as described above has some additional problems, 
most of which can be overcome by simple modifications. 
Assuming that prior information on the number of tele- 
phone households is not available, selection probabilities 
are not known, although the value of p can be estimated 
from the sample. The practical necessity to introduce a 
stopping rule for the number of calls to numbers which do 
not answer or to refusals to answer, even whether the 
number is a residential one, implies that the method cannot 
be strictly applied as designed, resulting in possible bias. 
The problem of households with multiple telephone num- 
bers can be overcome if correct information on the number 
of different lines is obtained but the required re-weighting 
impinges on the simplicity of equal weighting. In some 
cases names and addresses can be obtained for RDD 


numbers by matching with address lists so that advance 
notice can be sent to at least part of the potential respon- 
dents. However this is a complex procedure and the diffi- 
culties in sending advance notice to respondents (common 
to all RDD procedures) has made the procedure difficult to 
consider for some official statistical agencies. 


3.1.3 Modifications of the Mitofsky-Waksberg and 
Other RDD Methods 


Some of the drawbacks of the basic method are 
overcome by the generalization due to Potthoff (1987a, 
1987b). The method is based on the definition of a set of 
auspicious telephone numbers. This could consist of only 
residential numbers, as in the Mitofsky-Waksberg method, 
or a wider set which includes all residential numbers — for 
instance the set of all numbers which ring (including 
engaged, recorded messages and operators). The first stage 
of selection is by simple random sampling of a fixed 
number, m, of PSU’s. From each selected PSU a fixed 
number of calls, c, are made and for each of them it is 
determined whether the number is auspicious or not. A PSU 
is discarded if all c numbers selected are inauspicious. 
Retained PSU’s are defined as Type I if only one number is 
auspicious and as type II if two or more are auspicious. The 
second stage consists of selecting and dialing kc numbers 
from Type I PSU’s and k (c- 1) numbers from Type II 
PSU’s, where k is an integer. At all dialed numbers the unit 
is determined as residential or out-of-scope and an 
interview is attempted for all residential units. A supple- 
mentary sequential segment for Type I PSU’s selects 
additional telephone numbers that are dialed until a total of 
k auspicious numbers are obtained. An interview is 
attempted at each auspicious numbers dialed in the sequen- 
tial segment. Potthoff (1987a) shows that, under certain 
conditions, all residential telephone numbers have the same 
probability of selection and develops unbiased and ratio 
estimates and their variances. Cost comparisons and some 
modifications to overcome practical problems are also 
given. The method reduces the problem of ambiguity on the 
status of dialed numbers from which no response is 
obtained and also the problem of exhaustion of the 
residential numbers in a PSU. 

A large number of additional generalizations and modifi- 
cations to the basic Mitofsky-Waksberg method have been 
proposed. Many of these attempt to reduce the burden of 
interviewing screening and to improve control over the 
initial contact sample size. Thus Hogue and Chapman 
(1984) propose determining cutoff points on the basis of an 
estimation of the probability that a PSU is ‘sparse’, i.e., has 
a small proportion of residential numbers, and propose to 
determine an optimal cutoff procedure on the basis of cost 
and variance considerations. Alexander (1988) considers 
two types of cutoff rules to limit interviewing screening for 
prefixes with low residential densities. An ‘increasing rule’ 
Stops as soon as a predetermined number of calls, c,, has 
been made and less than i residences have been found, 
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where {c,} is an increasing series in 7. A “decreasing rule’ 
stops when i residences have been found if at least c, calls 
have been made, where {c;} is a decreasing series in i. The 
costs for these rules are evaluated under a simple model. 

Lepkowski and Groves (1986a) propose a two phase 
design based on matching prefixes selected in the first stage 
of the Mitofsky-Waksberg scheme to a commercial direc- 
tory to obtain counts of listed telephones for each prefix 
selected. Prefixes are allocated to two strata — a low density 
stratum where there are no listed telephone numbers, or 
only a small number of them, and a high density stratum. 
The Mitofsky-Waksberg design is applied to the low- 
density stratum and telephone numbers are selected with 
probability proportional to the number of listed telephone 
numbers in the high-density stratum. 

Brick and Waksberg (1991) propose using a fixed 
number of telephone numbers in the second stage so as to 
avoid sequential sampling altogether with a resulting 
simplicity of operation. The design, originally proposed by 
Waksberg (1984), is not, however, self-weighting and 
involves a slight bias and increased variance. Brick and 
Waksberg (1991) suggest considerations for the choice 
between the original and modified Mitofsky-Waksberg 
designs. For an early application of the modified Mitofsky- 
Waksberg method to the collection of health attitude 
information, apparently in an erroneous attempt to 
implement the original method — see Cummings (1979). 
Smith and Frazier (1993) compare the original and 
modified schemes, using data collected in the California 
Behavioral Risk Factor Surveillance System. The results 
indicate that the modified scheme speeds up the data 
collection, resulting in a larger sample size for the same 
cost. This compensates for larger design effects of the 
modified scheme. 

Another alternative to the basic Mitofsky-Waksberg 
method is the use of stratification and disproportionate allo- 
cation to improve ‘hit rates’, proposed by Palit (1983). An 
evaluation of alternative treatments of unanswered tele- 
phone numbers for the Mitofsky-Waksberg design is 
carried out by Palit and Blair (1986). The optimal determi- 
nation of parameters for the Mitofsky-Waksberg method is 
dealt with by Burke, Morganstein and Schwartz (1981) and 
the optimal allocation for the stratified version of the 
method by Casady and Lepkowski (1991, 1993) and by 
Tucker, Casady and Lepkowski (1992). Further problems 
relating to minimal cost allocation are treated by Palit 
(1983) and by Mason and Immerman (1988). 


3.1.4 List-Assisted Methods 


Although RDD methods overcome the undercoverage of 
directories due to unlisted numbers, they all still suffer from 
the basic problem of undercoverage due to non-telephone 
households (see further detail in section 3.3). In addition the 
lack of auxiliary information (such as geographical infor- 
mation), which is often available in list frames, leads to 
inefficiencies, even in the more sophisticated modifications 
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of the basic methods, mentioned above. Thus alternative 
methods have been sought to combine RDD samples with 
samples based on list and directory frames. One of the 
earliest attempts in this direction was that proposed by 
Stock (1962) and elaborated by Sudman (1973), based on 
replacing the last two digits of telephone numbers, selected 
from a directory listing, by random digits. The method was 
applied by Hauck and Cox (1974) to a methodological 
study of mode effects in screening for a special sub- 
population. A simpler version, popularly known as the ‘Plus 
One’ method, replaces each telephone number sampled 
from a directory by the number plus one (or some other 
digit — known as the ‘plus digit method). This supposedly 
overcomes the bias due to unlisted numbers. Due to its 
simplicity, the method has gained popularity among market 
researchers. However several studies — e.g., Landon and 
Banks (1977); and Mullet (1982) — have indicated that it is 
not, in fact, bias-free and also suffers from low efficiency. 

Forsman and Danielsson (1997) propose a model-based 
approach for plus digit sampling, based on the assumption 
of randomly mixed listed and unlisted numbers within 
prefix. The model, which is tested empirically, provides 
model unbiased estimates. Ghosh (1984) has proposed an 
improved method that continues adding one to the last 
telephone number dialed as long as a household is not 
reached and stopping once a household is reached. 
Although still biased, the bias is reduced as compared with 
the simple ‘plus one’ method. Other list-assisted methods 
with RDD components, are discussed by Potter, McNeill, 
Williams and Waitman (1991), who stratify prefixes 
according to counts of published telephone numbers, while 
ensuring inclusion of blocks without any published 
numbers. 

Brick, Waksberg, Kulp and Starer (1995) propose a 
list-assisted method that overcomes the troublesome 
problem of the sequential nature of the second stage 
sampling inherent in the Mitofsky-Waksberg scheme. The 
method is based on dividing the file of exchanges 
(100-banks) into two strata. The first consists of all 
exchanges with at least one listed residential phone and the 
second those that have none. Sampling only from the first 
stratum drastically reduces the proportion of nonresidential 
numbers which have to be dialed, but results in coverage 
bias. They investigate the bias and conclude that such 
truncated sampling methods are efficient and have 
operational advantages, while the resulting coverage bias 
(about 4%) is not too important. The method has been 
widely applied to replace the classical Mitofsky-Waksberg 
method. Similarly Statistics Canada has used the method 
for their General Social Survey since 1991 for the whole 
sample, with simple random sampling within banks of 
numbers identified as having at least one residential number 
(Norris and Paton 1991). Modifications of this design 
include a complete stratification of number banks on the 
basis of list information and using simple RDD for strata 
with small proportions of banks with no listing and the 
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Mitofsky-Waksberg method in the remaining strata. A 
comparison of this design with other stratified designs 
based on a cost model is carried out by Casady and 
Lepkowski (1993). Their results show that for low cost 
ratios (of productive selections to unproductive selections) 
two and three stratum RDD designs are as efficient as the 
Mitofsky-Waksberg scheme and that for high cost ratios 
they are superior. 


3.1.5 Multiple Frame Designs 


In an attempt to overcome some of the inherent biases of 
telephone surveys due to directory and telephone under- 
coverage, the use of dual frame mixed mode surveys, 
combining telephone with face-to-face interviewing, has 
received increasing attention. These combine conventional 
samples for personal interview with RDD or directory 
samples for telephone interviewing. Biemer (1983) investi- 
gated the optimal mix for such designs, via a simulation 
study, and McCarthy and Bateman (1988) propose the use 
of mathematical programming for attaining optimal alloca- 
tion of sample units for a dual frame design, which allows 
posterior analysis of the effects of variations in design and 
cost parameters on the optimization. Choudhry (1989) 
proposes a cost-variable optimization for estimating 
proportions and Brick (1990) proposes the use of multi- 
plicity sampling for this purpose. In a series of papers, 
Groves and Lepkowski (1985, 1986); Lepkowski and 
Groves (1984, 1986b); and Traugott, Groves and 
Lepkowski (1987) develop error models for these dual 
frame survey designs. They also report on results of 
experiments to compare response rates and potential biases 
of RDD and list samples and of several interviewing 
methods. The results were applied to the large scale US 

* National Crime Survey. 

Whitmore, Mason and Hartwell (1985) report on appli- 
cations of dual frame dual mode methods in a US Environ- 
ment Protection Agency sponsored study of personal 
exposure to carbon monoxide in two metropolitan areas and 
in a state-wide study of social service needs. In both cases 
commercially available directory lists were used in associa- 
tion with area household sampling. On the basis of an 
analysis of their results, they recommend the use of such 
dual designs in order to both benefit from the relative 
efficiency of telephone interviewing and to overcome the 
biases inherent in the use of directories as sampling frames. 
A combination of RDD and area sampling is reported by 
Waksberg, Brick, Shapiro, Flores-Cervantes and Bell 
(1997) for the US National Survey of America’s Families 
in with there was particular focus on the low-income 
population. The nontelephone households identified in the 
area screening were given cellular phones for responding to 
telephone interviewers, thereby avoiding the need to train 
the area screener interviewers in a non-telephone 
questionnaire (Cunningham, Berlin, Meader, Molloy, 
Moore and Pajunen 1997). 
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3.2 Other Sampling Issues 


3.2.1 Sampling for Special Populations 


The relative low costs of telephone interviewing have 
made this survey mode a prime candidate for use in 
screening large samples in order to locate small special 
populations. Thus Sudman (1978) discusses the conditions 
under which the use of a telephone sample for screening a 
subgroup, to be finally interviewed face-to-face, is more 
efficient than face-to-face screening. By analyzing cost 
functions, telephone screening is found to be efficient, 
unless within-cluster homogeneity is small, interview 
densities are low and/or location and screening costs are 
low, relative to interview costs. Blair and Czaja (1982) 
propose a modification of the Mitofsky-Waksberg proce- 
dure to locate special populations that cluster geograph- 
ically and describe an application to the Black population. 
As pointed out however by Waksberg (1983), their method 
requires reweighting when clusters are exhausted, which 
may result in reduced efficiency. This implies that the 
method may be efficient for the Black population but not 
necessarily for other minorities. Another telephone sample 
design targeting the US black population is proposed by 
Inglis, Groves and Heeringa (1987). Mohadjer (1988) 
proposes the stratification of prefix areas in an RDD design 
for sampling rare populations. The use of the Mitofsky- 
Waksberg method for selection of households combined 
with a stratified sample of individuals within household is 
used for the selection of a population-based control group 
in four epidemiological studies reported by Hartge, Brinton, 
Rosenthal, Cahill, Hoover and Waksberg (1984). The 
effectiveness of the method is studied by Perneger, Myers, 
Klag and Whelton (1993), on the basis of a simulation of 
simple random sampling, and found to be effective. 

Local area surveys are another example of special 
populations that can be dealt with efficiently by a telephone 
survey. Although, in general, telephone exchanges do not 
define geographical areas exactly, there is a high degree of 
correspondence and, with some screening for those in the 
defined area, telephone interviewing can reduce costs 
considerably. For instance Banks and Hagan (1984) report 
on the reduction of interviewer screening by a combination 
of list sampling and RDD for a survey to assess the 
effectiveness of health programs in specific service areas. 
Similarly, Campbell and Palit (1988) tested a combination 
of list sampling and TDD — total digit dialing, using a frame 
of all numbers in exchanges corresponding to a given 
census area. They found that this resulted in a substantial 
saving in enumeration costs, versus face-to-face inter- 
viewing. 


3.2.2 Sampling Individuals Within Households 


Almost all household surveys include questions relating 
to individuals in the household. In some cases all individ- 
uals belonging to the household are included in the sample, 
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but in many cases, for various reasons, a sample of one or 
more individuals is selected within the household for 
individual questions. The classic Kish procedure (Kish 
1949), predominantly used in face-to-face interview surveys 
raises particular problems for telephone surveys, because it 
requires obtaining complete household listings over the 
telephone. This is more difficult to obtain over the phone 
than in a face-to-face interview, where some of the persons 
may be physically present. It should be pointed out however 
that in many cases the information on household compo- 
sition is required in any case. In addition the manipulation 
of the selection rules by the interviewer (e.g., to accomplish 
high response rates), which has long been suspected in 
face-to face interviewing is almost impossible in CATI 
surveys (where selection is invisible to the interviewer). 

Troldahl and Carter (1964) proposed a method whereby 
only the number of persons of each sex is required. 
Probabilistic rules (e.g., ‘oldest man’) are then applied to 
determine the individual selected, ensuring known selection 
probabilities for each person. However a positive probabi- 
lity of selection for each individual is not ensured (e.g., in 
households with three males the one of intermediate age is 
never selected). The method (known as the ‘Troldahl-Carter 
method’) has been modified by Bryant (1975), in order to 
take into account the possibility of households with more 
than two individuals of the same sex. An alternative method 
proposed by Salmon and Nichols (1983) and by O’ Rourke 
and Blair (1983) is to select the person with the next (or 
last) birthday (the ‘next-birthday’ or ‘last-birthday’ 
method), which ensures equal probability of selection for 
each household member, under the assumption that the date 
of interview is random. This is of course a reasonable 
assumption only for surveys carried out over a twelve- 
month period but not for surveys with shorter interview 
periods. This and other factors may lead to selection 
probabilities that are correlated with the individual charac- 
teristics. Another selection method proposed by Hagan and 
Meier (1983), which does not require any preliminary 
information on household composition, selects a pre- 
defined person (e. g., “eldest man’). The method again fails 
to ensure a positive probability: of selection for each 
household member. 

Several empirical comparisons of the above methods 
have been carried out. Czaja, Blair and Sebestik (1982) 
found no significant differences in response rates or in 
demographic profiles between two versions of the Troldahl- 
Carter method and the Kish method. Hagan and Meier 
(1983) compare their method, described above, with the 
Troldahl-Carter method and find that the method they 
propose has a significantly lower refusal rate, with no 
significant differences in demographic profiles. Salmon and 
Nichols (1983) compare four procedures for selecting 
respondents within a household unit — Troldahl-Carter, 
male/female alternation, next-birthday and no-selection 
methods — in a small telephone survey. They reach the 
conclusion that the next-birthday method is a relatively 
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efficient procedure for selecting a sample that is represen- 
tative of all household members. Oldendick, Bishop, 
Sorenson and Tuchfarber (1988) find no significant 
differences between the Kish method and the last-birthday 
method. In a study using the last birthday method, Romuald 
and Haggard (1994) find that informants self-select to 
participate at a higher rate than expected. They investigate 
the effect of using memory cues on respondent self- 
selection and reach the conclusion that there is no 
significant effect. Lavrakas, Bauman and Merkle (1993) 
evaluate the effect of the use of the last-birthday method on 
within-unit coverage in a national survey and report 
evidence to suggest that the method leads to incorrect 
selection in many cases. Forsman (1993) reviews expe- 
riences of within-household sampling for 18 private opinion 
research companies and report on a test to compare the 
Kish, next/last birthday and the Toldahl-Carter methods. 
They conclude that the Troldahl-Carter method is somewhat 
better than the Kish method and that both are superior to the 
birthday methods. Similarly, Binson, Canchola and Catania 
(2000) report on a three-way comparison in a national 
telephone survey between the Kish, next-birthday, and 
last-birthday methods, and find significant differences 
between the three methods in the dropout rate, during the 
initial stages of the screening process. The Kish method had 
the highest dropout rates and the ‘next-birthday’ had the 
lowest rate. They conjecture that interviewers, rather than 
respondents, are a primary source of the higher rate of 
refusals when using the Kish method, due to the fact that a 
full household roster is required. 


3.3. Coverage and Nonresponse 
3.3.1 Telephone Coverage 


The problem of telephone noncoverage was until very 
recently a major drawback of telephone surveys. Even in 
the US overall person undercoverage (in nontelephone 
households) remained at 7.2% by the end of 1986 — 
Thornberry and Massey (1988). By the mid-eighties 
household telephone undercoverage was less than 10% in 
most Western countries, with the highest coverage (99%) in 
Sweden. But some countries still had high rates of 
telephone undercoverage, for instance: UK 25%, Italy 29% 
Ireland 50%, Israel 30% — Trewin and Lee (1988). The 
situation changed dramatically towards the end of the 
century, with most Western countries reaching virtual 
saturation. Telephone coverage reached 94.4% in the US in 
1999 (NTIA 2000); 96.6% in Australia in 1996 (St. Clair 
and Muir 1997); 97.0% in the UK (OFTEL 1999); 97.3% in 
Israel (Central Bureau of Statistics 2000); 97.9% in Finland 
(Kuusela and Vikki 1999); 98.2% in Canada (Statistics 
Canada 1999); and 99% in Germany (Federal Republic of 
Germany 1999). 

Obviously the major problem of telephone undercov- 
erage lies primarily in differential undercoverage rather 
than in its overall rate and the fact that telephone under- 
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coverage is highly correlated with a wide range of 
demographic, economic and health variables. This has been 
demonstrated extensively in a large number of empirical 
studies in the US and elsewhere — see for instance Groves 
and Kahn (1979), Collins (1983, 1999), Thornberry and 
Massey (1983, 1988), Trewin and Lee (1988) and Botman 
and Allen (1990). The rapid increase in overall telephone 
coverage over the last decade has not caused any radical 
change in this situation. Thus in Finland, with an overall 
telephone undercoverage of 2.1% in 1999, low income 
households (less than 675 Euros per month) had an 
undercoverage of 11.3% (vs. 0% for high income groups) 
and those living in rented accommodation 4.9% (Kuusela 
and Vikki 1999). In Israel telephone undercoverage was 
17.9% for the lowest income decile as against 0.8% for the 
two highest deciles and 24.9% for single adult households 
with three or more children as against 2.4% for childless 
households with three or more adults (Central Bureau of 
Statistics 2000). Similarly in the US large geographical 
variations are still found and telephone undercoverage is 
found to correlate with housing deficiencies, race, educa- 
tion income and mobility (Shapirc, Battaglia, Hoaglin, 
Buckley and Massey 1996; Giesbrecht, Kulp and Starer 
1996; Fox and Riley 1996; NTIA 2000). Health- related 
characteristics were found to differ somewhat between 
persons in telephone and non-telephone households in the 
National Health Interview Survey by Anderson, Nelson and 
Wilson (1998) and in the National Health and Nutrition 
Examination Survey by Ford (1998). However telephone 
coverage effects were considered to be minor in both 
studies. 

However the main problem of telephone coverage 
foreseen for the near future relates to the introduction and 
tapid proliferation of mobile telephones. In the late nineties 
the proportion of households with access to at least one 
mobile telephone reached 76% in Finland, 59% in 
Denmark, 35% in Italy (Rouquette 2000) and 52% in Israel 
(Central Bureau of Statistics 2000). If all these mobile 
telephones were additional to fixed line telephones no 
problem would arise. However there are already strong 
indications of a tendency in several countries to consider 
the mobile telephone as an alternative to a fixed line tele- 
phone, rather than a supplement. Kuusela and Vikki (1999) 
report that 20% of Finnish households now have exclusi- 
vely one or more mobile telephones and no fixed line and 
predict that within a year the number of mobile phones will 
exceed the number of fixed lines. Similar figures for the UK 
are 3% (OFTEL 2000) and for Israel 2.9% (Central Bureau 
of Statistics 2000). This implies that fixed line telephone 
coverage is down to 77% in Finland and to 94% in the UK 
and in Israel. In Germany it is estimated that the percentage 
of households with fixed line telephones will decrease to 
92% by 2004 (Gabler and Haeder 2000). Furthermore the 
characteristics of persons with only mobile telephones are 
quite different from those with fixed telephone lines. In 
Finland, according to Kussela and Vikki (1999), they tend 
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to be young, often living alone in rented apartments in 
urban areas. It should be noted that the transfer from fixed 
phone lines to mobile telephones is apparently not occur- 
ring to any large extent in North America, due to diffe- 
rences in pricing strategies. 

Theoretically RDD sampling could be extended to 
mobile telephones. In practice, this may be quite difficult 
due to the fact that mobile telephones are by nature a 
personal appliance, rather than a household one. Sampling 
persons within a household, via a mobile telephone contact 
with one of the members, is well nigh impossible. Inter- 
viewing via a mobile telephone of individuals who may be 
anywhere is also extremely difficult. Even the determina- 
tion of the total number of telephone numbers (mobile and 
fixed line) available to a household (required for weighting) 
may be daunting. We consider some possible approaches to 
these and other problems of the move to mobile telephones 
in section four. 

Undercoverage of persons within covered households 
relates primarily to the method of selection for individuals 
within the household — see section 3.2.2 — and to the under- 
coverage due to the failure to obtain complete listings of 
individuals in the households. The latter effect is investi- 
gated by Maklan and Waksberg (1988), by comparing data 
on individuals obtained from an RDD survey with those 
obtained from the US Current Population Survey and from 
the population census. They find that while mean household 
sizes are comparable, the RDD results are skewed towards 
two-person households and away from one-person house- 
holds. Some of the difference could be attributed to 
different residence rules, but the results do not indicate 
undercoverage of persons in the RDD survey. They also 
report on an experiment in which more detailed questions 
were asked on household composition and found practically 
no improvement in accuracy of reporting. In a similar 
experiment, carried out by Bercini and Massey (1979), the 
effects of the use of names in the household roster and the 
position of the question on the household roster (before or 
after the first interview) were tested in a survey on smoking. 
They found that both the use of names and the position of 
the household roster had an effect on response and that 
obtaining the roster after the interview without names is 
optimal. 


3.3.2 Nonresponse 


The problem of nonresponse and the biases associated 
with nonresponse is basic to all survey research, but there 
are some specific issues of nonresponse associated with 
telephone interviewing. One of the major problems is the 
ambiguity of the results of many attempts at dialing — e.g., 
continually engaged or no reply, numbers connected to fax 
machines, computer modems or answering machines. 
Recently automated screening devices have been developed 
to identify telephone numbers connected to recordings 
indicating whether they are not in service (Casady and 
Lepkowski 1999). Thus proprietary hardware and software 
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have been developed to detect “tri-tone” recording which 
indicates “not-in-service” and these numbers when dialed 
can be removed from the sample. Prior removal of many 
business phones can be carried out by matching with 
“Yellow Page” files. These and other methods reduce the 
costs of screening and the ambiguity of calls that 
continually receive no reply. 

Technological advances, such as “call forwarding” and 
caller identification enhance the possibilities for non- 
response. In addition refusals are easier over the phone than 
in face-to-face interviews and breaking off the interview in 
its midst is also easier. These and other problems of 
nonresponse for ‘cold’ telephone interviewing and the US 
experience in dealing with them are reviewed extensively 
by Groves and Lyberg (1988a). In particular they follow 
CASRO (1982) and White (1983) in recommending a 
definition of nonresponse rates which includes in the 
denominator an estimate of the number of unanswered 
numbers that are working numbers in addition to the 
complete and incomplete interviews, refused eligible 
numbers and other noninterviewed units. The estimate of 
the proportion of unanswered numbers that are eligible is 
obtained as the proportion of answered numbers that are 
eligible. However this may be a biased estimator. For 
instance the intensive use of answering technology by 
businesses implies that practically all businesses will 
respond and can be identified as businesses. Also, as 
pointed out by Massey (1995), this measure has to be 
modified in the case of screening by defining a household 
screening response rate as the estimated proportion of 
eligible households identified as such by the screening, 
rather than the proportion of all households screened for 
eligibility. Cunningham, Brick and Meader (2000) present 
several detailed measures of response rates and eligibility 
rates for each stage of a survey with screening, as well as 
overall rates, in reporting on the methodology of the 
National Survey of America’s Families. 

Telephone nonresponse rates are, in general, higher than 
those obtained from face-to-face interviews, due to the 
reasons mentioned above — see Hochstim (1967), Groves 
and Kahn (1979), Fitti (1979), Groves and Lyberg (1988a) 
for US experience; Wilson, Blackshaw and Norris (1988), 
and Collins, Sykes, Wilson and Blackshaw (1988) for 
experience in UK surveys; and Drew, Choudry and Hunter 
(1988) for the experience of Canadian government surveys. 
The latter includes also comparisons of ‘cold’ and ‘warm’ 
telephone interviews, which show only small differences in 
nonresponse rates. More recently an analysis of the 
experience in 39 US telephone surveys carried out in the 
nineties (Massey, O’Connor and Krotki 1997) indicates a 
slight further reduction in response rates to an average of 
62% and a range from 42% to 79% (though it seems that 
Canadian response rates have not decreased over recent 
years). Among the factors to which this increase in 
nonresponse can be attributed are the increase in the use of 
technological devices (answering machines, call 
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forwarding, multi -purpose telephone lines) and the 
increased prevalence of telephone solicitation, already 
identified as a potential problem for telephone surveys by 
Biel (1967). The American Statistical Assocation (1999) 
considers the effect of near saturation calling conducted by 
telemarketers on lowering survey cooperation rates as a 
serious Challenge not fully addressed by survey researchers. 
It concludes that unless the trend can be reversed, 
“telephone surveys, as we know them, could disappear 
within the next five years”. A similar view is expressed by 
Kalton (2000). 

As is the case for telephone noncoverage, the effect of 
nonresponse on biases in survey estimates is made more 
severe by the correlation between nonresponse and many 
socio-economic characteristics. Groves and Lyberg (1988a) 
on the basis of a review of previous work identify the main 
correlates of telephone nonresponse. They are age (elderly 
persons have higher refusal rates — see also Collins et al. 
1988) and education (higher nonresponse among lower 
education groups - see, e.g., Cannel, Groves, Magilavy, 
Mathiowetz, Miller and Thornberry 1987). On the other 
hand, there is evidence showing that urban-rural differences 
in nonresponse are diminished in telephone surveys, as 
compared with face-to-face surveys — Groves and Kahn 
(1979). More recent papers on the effects of nonresponse 
concentrate on specific issues. Thus Diehr, Koepsell, 
Cheadle and Psaty (1992) investigate the relationship of 
response rate and other summary variables at the prefix and 
at the person level. They find relationships between 
nonresponse and age, race and family size and type. 
Merkle, Bauman and Lavrakas (1993) in an investigation of 
the impact of callbacks on the quality of survey estimates 
show that age and employment status are the major 
correlates with the number of callbacks required. Kalsbeek 
and Durham (1994) investigate the effect of nonresponse in 
a follow-up telephone survey on breastfeeding among 
low-income women and find that the main correlates with 
nonresponse are age and degree of urbanization. Finally, 
multilevel modeling is applied to an extensive meta-analysis 
of reports on inter-mode comparisons of nonresponse by 
Hox, DeLeeuw and Kreft (1991). The results, based on the 
analysis by multi-level modeling of a total of 45 studies (35 
of which included a telephone component), indicate 
significantly lower response for telephone studies than for 
face-to-face studies when models with fixed slopes are 
used. However when random-slope models are used the 
difference is no longer significant. 

In attempts to reduce nonresponse in telephone surveys 
the effect of survey operational variables on nonresponse 
has been investigated. Thus Sebold (1988) finds that 
doubling the survey period (from two to four weeks) 
increased the response rate by 3 percentage points in an 
experiment for the US National Crime Survey. Brick and 
Collins (1997) investigated the effect of advance letters and 
screening questions on response in the US National 
Household Education Survey. They found that a screen-out 
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question approach increased response rates considerably 
but that the advance letter did not add to the effect of 
screening. Other survey variables that have been found to 
affect response rates are interview length (Collins, et al. 
1988) and interviewer vocal characteristics (Oksenberg and 
Cannel 1988). The effect of the method of selection of 
sample individuals on nonresponse (in particular the 
requirement for household rosters) has already been 
mentioned in section 3.2.2. 

Finally, in recent years there has been a significant 
increase in the use of answering machines and caller ID 
devices for screening unwanted calls, with obvious 
increased potential for nonresponse. For instance, the 
proportion of households with answering machines in 
France increased from 21% in 1995 to 40% in 1999 
(Rouquette 2000), the same as in Germany (Federal 
Republic of Germany 1999), while in the US the proportion 
increased from about 25% in 1988 (Tuckel and Feinberg 
1991) to over 73% by 1997 (Decision Analyst 1997). 
However, based on a nationwide telephone survey, Tuckel 
and Feinberg (1991) reach the conclusion that, in 
comparison to other initial non-contact groups (¢é.g., ‘no 
answer’ or ‘busy’), those with answering machines are 
more likely to respond and less likely to refuse, resulting in 
a contact rate which is definitely not smaller than that of 
other non-contacts. In fact, it seems, according to a study by 
Oldendick and Link (1994), that the use of answering 
machines to screen out survey calls is limited to some 2-3 
percent. However screeners tend to be in higher income 
groups, urban and with higher education. Similarly, Piazza 
(1993) finds on the basis of extensive data from the 
California Disability Survey, a telephone survey with a high 
number of callbacks, that although answering machine 
owners are more difficult to contact initially, once contacted 
they are at least as likely to respond as those without 
answering machines. They point out also that reaching an 
answering machine ensures that a household has been 
reached and that its residents do not want to miss important 
calls. In a study by Xu, Bates and Schweitzer (1993), 
designed to investigate the effect of leaving messages on 
answering machines, households with answering machines 
were found to be more likely to be contacted and to 
complete the interview than those without answering 
machines. Furthermore leaving a message on the answering 
machine led to a significant increase in response rate and 
reduction in refusals. Similarly, Harlow, Crea, East, Oleson, 
Fraer and Cramer (1993), based on results of a controlled 
experiment, found that leaving a message on the answering 
machine led to an increase of 15% in response, after 
adjusting for age, interviewer and town of residence. 
Koepsell, McGuire, Longstreth, Nelson and van Belle 
(1996) carried out a randomized trial of leaving messages 
on answering machines and found an overall increase of 
20% in response rate. Although in a similar study Tuckel 
and Shukers (1997) found no significant effect, the overall 
findings in a range of studies indicate that the increase in 


the use of answering machines has a beneficial effect on 
survey response, probably due to their providing the 
possibility of leaving positive messages and thereby 
enabling the screening out of telemarketing calls. 

Tuckel and ONeill (1996) estimate that the percentage 
of US households with caller ID increased from 3% in 1992 
to 10% in 1996. Based on a national study, in which the 
profiles of both caller ID subscribers and answering 
machine owners are analyzed, they reach the conclusion 
that these technological devices do not yet present major 
obstacles for telephone survey research, since their owners 
tend to use the screening devices primarily to screen out 
recognized undesirable numbers of acquaintances rather 
than unrecognized numbers. However, they point out that 
the possibility of screening will probably lead to increases 
in answering machine response to repeated callbacks. 


3.3.3 


Telephone surveys often require special attention to 
weighting and adjustment. Although sampling designs are 
usually based on equal probabilities of selection, in practice 
these are not always achieved. For instance RDD sample 
designs are theoretically self-weighting but in fact unequal 
selection probabilities may result due to the multiplicity of 
telephone lines (numbers) for the same household. In this 
case, if information is collected on the number of telephone 
lines to which the household is connected, the required 
adjustment is straightforward. Similarly reweighting is 
required to take into account PSU’s for which the number 
of in-scope numbers is less that the required cluster sample 
size. An additional problem arises due to the fact that it is 
often difficult to determine whether a telephone, from 
which no answer can be obtained after repeated attempts, is 
indeed a case of in-scope nonresponse or is, in fact, 
out-of-scope. Other problems requiring reweighting are 
nonresponse, the inherent undercoverage due to non- 
telephone households and the obvious necessity to use some 
form of multiplicity estimator for multiple-frame sample 
designs, based on information on the frames on which the 
unit is represented. 

These problems are dealt with for national RDD samples 
carried out by the US National Center for Health Statistics 
in a series of papers by Thornberry and Massey (1978); 
Botman, Massey and Shimizu (1982); and Massey and 
Botman (1988). They describe the weighting adjustments 
carried out for the RDD US National Health Interview 
Survey (NHIS) and for a smoking survey to account for 
multiple telephones per household, for telephone coverage 
and for nonresponse. The adjustments were based on 
external data for race and geographic region and on survey 
information on nonresponse and on multiple telephones. 
Several alternative adjustment and weighting procedures 
are compared and evaluated. Chapman and Roman (1985) 
compare substitution with nonresponse adjustment in a 
feasibility study for the RDD NHIS and report that the 
results with respect to bias and variance are similar. Drew 
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and Groves (1989) compare alternative adjustment proce- 
dures for unit nonresponse based on external administrative 
data, on an explicit response prediction model and on 
response probabilities estimated on the basis of callback 
data. Casady and Sirken (1980) propose a multiplicity 
estimator for a multiple-frame sampling design applied to 
data from the US National Health Interview Survey. Brick 
(1990) compares the multiplicity estimator with the tradi- 
tional multiple frame estimator for an educational RDD 
survey. 

Goksel, Judkins and Mosher (1991) report on adjust- 
ments, based on modeling nonresponse propensities, for a 
telephone follow-up of a face-to face interview in the US 
National Survey of Family Growth. Adjustment based on 
response propensities by intensity of follow-up effort and 
by smoking status are proposed for a Canadian survey of 
attitudes to smoking restrictive legislation by Bull, 
Pederson and Ashley (1988). 

Following a comparison by Keeter (1995) of non- 
telephone households with ‘transient’households (those 
who recently gained or lost telephone service), Brick, 
Waksberg and Keeter (1996) propose the use of data on 
interruptions in telephone service in order to adjust for the 
undercoverage due to non-telephone households. Their 
results indicate that such adjustment can lead to a reduction 
of mean square error. Hoaglin and Battaglia (1996) 
compare a modified poststratification method and a 
model-based estimation with simple poststratification for 
adjusting for noncoverage in an RDD survey of vaccination 
coverage. The modified poststratification uses national data 
on vaccination rates for telephone and non-telephone 
children in addition to demographic and socioeconomic 
data used for simple poststratification, while the model- 
ased adjustment is based on a logit model to estimate the 
probability of residing in a telephone household. The results 
show gains from the use of the modified poststratification 
but only slight differences between the modified post- 
stratification and the model based adjustment. A similar 
adjustment based on telephone interruption data is applied 
by Frankel, Srinath, Battaglia, Hoaglin, Wright and Smith 
(1999) to NHIS data and shows conclusively a substantial 
reduction in bias. 


3.4 Data Quality —- Response Errors and 
Mode Effects 


The quality of information obtained over the telephone 
has always been a controversial issue. As mentioned in 
section 2, apprehensions on the supposed inferiority of the 
quality of data from telephone interviewing were allayed at 
an early stage, to a large degree by some of the extensive 
empirical appraisals carried out in the sixties and seventies. 
However there was still some conflicting evidence from 
different studies on the relative quality of telephone and 
face-to-face interviewing. Although the intensive analysis 
of large omnibus surveys carried out under the two modes 
by the University of Michigan Survey Research Center 
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(Groves and Kahn 1979), provided important information 
on data quality and other issues, the mode comparisons and 
a comparison with external data were not conclusive. In an 
attempt to resolve the issue, de Leeuw and van der Zoowen 
(1988) carried out an extensive meta-analysis of 28 major 
empirical studies in which comparisons of face-to-face and 
telephone interviewing were investigated. The studies, 
carried out between 1952 and 1986 on a variety of topics, 
were primarily from the US but some European studies 
were also covered. Data quality indicators used were 
response validity (based on validation studies), absence of 
social desirability bias, item response, amount of infor- 
mation (for open questions or check-lists) and similarity of 
response. The overall finding is that if there are any diffe- 
rences in quality between the two modes, they are definitely 
very minor and that other considerations, such as costs and 
convenience, should be used in decisions on the use of the 
telephone for survey work. Similar conclusions are reached 
for the UK by Sykes and Collins (1988), on the basis of 
four comparative studies; for income data in Denmark by 
K6rmendi (1988), in a validation study, based on admini- 
strative data; and in a comparison of financial data in a 
Canadian Farm Financial Survey (Caron and Lavallée 
1998). 

Other recent studies on mode effects concentrate on 
specific issues and topics but reach similar conclusions. 
Thus Herzog and Rodgers (1988) report on a mode compa- 
rison in a study of older adults and find only small 
differences. Similar results are reported by Foley and Brook 
(1990) for a survey on the last days of life. In a study of the 
sensitive topic of drug use Aquilino and Lo Sciuto (1990) 
find almost identical results for whites, but some significant 
differences for blacks, even after controlling for variables 
possibly related to telephone undercoverage. This may be 
explained by results reported by Johnson, Fendrich, 
Shaligram and Garey (1997) for a telephone survey of drug 
use, which supports a social distance model of interviewer 
effects. 

There is little doubt that interviewers have a great effect 
on quality, both in face to face and in telephone surveys. 
The use of central telephone interviewing facilities provides 
more opportunities to control and monitor interviewer 
effects than in field interviewing. Some of the issues 
involved are treated by Stokes and Yeh (1988), who 
propose a Bayesian model for interviewer effects and 
methods for estimating the model parameters. A beta- 
binomial model for the interviewer variance component and 
methods of estimation of its parameters are proposed by 
Pannekoek (1988). 

An effective way of reducing response errors in 
face-to-face interview surveys has been the use of records 
provided by the respondent to verify and recall information 
on income, insurance, health events etc. Obviously, the 
extension of this method to telephone interviewing involves 
some problems, since the interviewer cannot see the 
documents and even asking the respondent to get them may 
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involve a disruptive break in the telephone interview more 
frequently than in a face-to-face interview. However the use 
of records by respondents in telephone surveys can help to 
reduce response bias. Battaglia, Shapiro and Zell (1996) 
report on an attempt to ask respondents to use vaccination 
records in one of the rounds of the US National 
Immunization Survey and to compare the information 
obtained with provider records. Some 47% of the respon- 
dents did in fact use vaccination records but substantial 
underreporting bias was still found, possibly due to the fact 
that the vaccination reports were not always up to date. 
Similar effects are found in face-to-face surveys — see 
Brick, Kalton, Nixon, Givens and Ezzati-Rice (2000). 


4. CURRENT AND FUTURE TECHNOLOGICAL 
DEVELOPMENTS 


Together with almost complete telephone coverage, the 
very intensive technological development and the diversity 
of communications possibilities are continuously opening 
up new opportunities and potentials for using novel 
communication options for survey work. On the other hand, 
some of these developments may cause difficulties for 
telesurveys under the conventional methodology of today. 
Thus the increased sophistication of filtering devices and 
algorithms (as a development of the simple answering 
machines and caller ID devices mentioned in section 3.3) 
may make it easier than ever for respondents not to 
cooperate. In the following we examine present applications 
and conjectured future developments and comment on the 
methodological problems involved in their use. 


‘4.1 E-Mail and Web Surveys 


Internet access for households has experienced a very 
rapid increase in recent years. For instance in the US the 
proportion of households with access to the Internet has 
risen from 26% in December 1998 to 42% in August 2000 
— NTIA (2000). Other countries have reached somewhat 
lower levels — the UK 28% (in August, 2000 — OFTEL 
2000), Canada 25%, Finland 22%, France 7% and Belgium 
5% in 1999, according to Rouquette (2000), Israel 12% (in 
1999 — Central Bureau of Statistics 2000) and Germany 
11% (Federal Republic of Germany 1999). This rapid 
increase in coverage, is still far off from attaining complete- 
ness. Furthermore, there are also some indications that, 
together with the increase in total use, there is also a 
growing category of ex-users. Katz and Aspden (1998) 
report that the proportion of former users of the Internet 
increased from 8% to 11% between 1995 and 1996. 
However the overall increase in access has encouraged the 
use of e-mail and the Internet for survey work. While 
coverage for an e-mail survey (EMS) is comparable to that 
of a Web (or Internet) survey and both are based on the use 
of a computer self-administered questionnaire (CSAQ), 
there is a basic difference between these two types of 


telesurveys. The e-mail survey is very similar to a mail 
survey, in that it is based on sending out a text questionnaire 
and asking the respondent to send back the completed 
questionnaire. The advantage over the mail survey is in cost 
and in the ease and simplicity of transmission and receipt. 
The Web survey is, in general, based on interaction 
between the respondent and the survey instrument, via the 
use of Java, XML, or a similar instrument. It allows 
multiple enhancements, such as colour and animation, and 
extensive possibilities for sophisticated skip patterns and 
real-time editing. The exciting potential for innovative 
collection systems based on ever-developing Web tools 
cannot yet overcome the basic problem inherent in both 
e-mail and Web surveys that current coverage is completely 
inadequate for most human populations of interest (Dillman 
2000). 

Nonetheless, e-mail and Internet surveys can and are 
being used, with varying degrees of success, for certain 
populations where coverage is virtually complete or in 
conjunction with other modes of collection. Thus Couper, 
Blair, and Triplett (1999) report on an experimental study 
comparing e-mail and regular mail for a survey of 
employees in several U.S. government statistical agencies. 
The sampled employees were randomly assigned to a mail 
or e-mail mode of data collection and comparable proce- 
dures were used for advance contact and follow-up of 
subjects across modes. The results indicated somewhat 
higher response rates for mail than for e-mail, but data 
quality (item missing data) was similar across the two 
modes. In field tests for the 1999 US National Study of 
Postsecondary Faculty both administrators and faculty were 
offered the choice between completing and mailing a 
conventional paper questionnaire or completing a CSAQ 
via the Web (Abraham, Steiger and Sullivan 1998). 
Although it may be assumed that practically all respondents 
had access to the Web, only 8% of responding faculty and 
17% of the institution administrators opted for the CSAQ 
mode. The US National Science Foundation is planning to 
use a Web-based option in its 1999 National Survey of 
Recent College Graduates, under the hypothesis that most 
of the survey population would be relatively computer 
literate and have access to the Web (Meeks, Lanier, Fecso 
and Collins 1998). For a review of the use of CSAQ by 
government agencies and private survey organizations and 
the problems involved, see Ramos, Sedivi and Sweet 
(1998). 

However, most current Web surveys of general 
populations are based on non-probability sampling — mostly 
by some form of self-selection. Fischbacher, Chappel, 
Edwards and Summerton (1999) report on a meta-analysis 
of 28 surveys in the health field using e-mail and the 
Internet. Many of these were epidemiological studies aimed 
at patients of specific diseases and the problem of selection 
bias meant that most of the results could not be generalized. 
One of the largest Web surveys is the WWW User Survey 
carried out by the Graphics Visualization and Usability 
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Center at Georgia Institute of Technology (Kehoe, Petkow, 
Sutton, Aggarwal and Rogers 1999). Although the survey 
population is defined as Internet users, the lack of any 
sample framework for this population implies that 
respondents had to be solicited by various methods (Web 
and other media announcements, advertising banners, 
incentive cash prizes etc.), rather than sampled with known 
probabilities. Although some 20,000 users participated, the 
survey report points out that the data is biased towards 
experienced and more frequent users and recommends the 
augmentation of their data with random sample surveys. In 
an attempt to overcome the bias inherent in basing surveys 
on samples of those with internet access only, some 
commercial survey organizations distribute devices, which 
let users access the Internet through television sets, to all of 
its panelists on an RDD sample, to ensure consistent results 
(Felson 2001). However Poynter (2000) predicts that by the 
year 2005 95% of market research surveys will be 
conducted via the internet but that 80% will be based on 
respondents who have ‘opted in’, rather than on probability 
sampling. 

On the other hand, there is evidence that Web-based data 
collection can be applied with relative success for establish- 
ment surveys. Nusser and Thompson(1998) report on its 
use for the US Department of Agriculture’s National 
Resources Inventory Surveys; Rosen, Manning and Harrel 
(1998) on Web-based collection from establishments for the 
US Current Employment Statistics Survey and Meeks et al. 
(1998) on its use for data collection from academic 
institutions, federal agencies and private corporations for 
US National Science Foundation surveys. Assuming that 
the problem of coverage and sampling will eventually be 
resolved for households and individuals, this holds hope for 
Web-based collection for household surveys at some point 
in the future. 


4.2 Other Computer Self Administered 
Questionnaire (CSAQ) and Computer 
Assisted Self Interviewing (CASI) Methods 


Couper and Nichols (1998) differentiate between 
computer self administered questionnaire (CSAQ) collec- 
tion, in which an interviewer is not present, and computer 
assisted self interviewing (CASJ), in which an interviewer 
is present or delivers the survey instrument. Thus both 
e-mail and Internet surveys are based on CSAQ with the 
assistance of telecommunications technology. Other CSAQ 
methods are touchtone data entry (TDE), whereby 
respondents enter data using their touchtone telephones, 
and interactive voice recognition (IVR) or voice recognition 
entry (VRE). Both are based on respondents initiating calls 
to report at their convenience, after initial contact has been 
established, and have been extensively tested and 
successfully used by the US Bureau of Labor Statistics for 
data collection from establishments for its Current 
Employment Statistics program — Werking, Tupek and 
Clayton (1988), Winter and Clayton (1990) and Clayton 
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and Winter (1992). Phipps and Tupek (1991) report on a 
study of the quality of TDE collection, by means of a record 
check. Their results show that there are few problems with 
the method and that response errors diminish with expe- 
rience. More recently US statistical agencies have initiated 
tests of the possibility of applying these CSAQ methods to 
household surveys. McKay, Robison and Malik (1994) 
report on initial laboratory testing of TDE for the Current 
Population Survey. Malakhoff and Appel (1997) report on 
the development of an IVR prototype at the US Bureau of 
Census, albeit for a listing operation by field staff. It should 
be noted that while TDE is obviously unique to telephone 
surveys, IVR could be used for other modes of collection. 

Computer assisted self interviewing (CASI) methods 
include audio (ACASJ) and video (VCASI) modes of 
collection and have long been regarded as the natural 
extensions of mail surveys that benefit from modern day 
technology (Dillman 2000). Their usefulness has been 
especially emphasized for surveys of sensitive and 
embarrassing topics, where the presence of the interviewer 
during the interview may make respondents reluctant to 
answer in a face-to-face interview. For a review of recent 
advances in these methods see Baker (1998), O’Reilly, 
Hubbard, Lessler, Biemer and Turner (1994), Rogers, 
Miller, Forsyth, Smith and Turner (1996) and Tourangeau 
and Smith (1998). Practically all the reported applications 
are of surveys in which the survey instrument is brought to 
the respondent’s home by field staff. The use of the 
telephone for ACASI (T-ACASIJ) collection has already 
been tried — Turner, Forsyth, O’Reilly, Cooley, Smith, 
Rogers and Miller (1998). The long-expected development 
of videotelephony to become a widespread common form 
of telephone service for households has not yet materi- 
alized. If and when it occurs it should make telephone 
VCASI (T-VCASD possible in the future, with important 
implications for telesurvey work. The addition of a visual 
element will help to overcome many of the problems of 
present day telephone surveys that are not present in 
face-to-face interviews (eye contact with the interviewer, 
use of cue cards and other visual aids). The use of video- 
telephony will probably not be universal for a very long 
time, so that at least for the time being, T-VCASI will only 
be able to serve as a supplementary mode of collection. 


4.3 Mobile Telephones 


The problems envisaged for coverage of fixed line RDD 
surveys due to the rapid proliferation of mobile telephones 
have been mentioned in section 3.3.1. In the future it is 
obvious that mobile telephones will have to be used to 
reach the ever-increasing numbers of households without 
fixed telephone lines. Present levels of mobile telephone 
coverage imply that mobile telephone surveys can, in 
general, only be used for specific populations or for 
supplementing fixed line RDD surveys. For instance 
Perone, Matrundola and Soverini (1999) report on a mobile 
telephone survey for a naturally accessible population - that 
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of mobile telephone subscribers in order to assess customer 
satisfaction. Refusal rates were found not to exceed those 
found in fixed line telephone surveys. However, non- 
contact rates were high, primarily due to subscribers being 
outside the signal range or shutting down their telephones. 
An additional problem associated with mobile phone 
surveys is that in many cases in North America the 
subscriber has to pay for received calls — Casady and 
Lepkowski (1999). 

As mentioned above, Cunningham, et al. (1997) report 
on the use of mobile telephones to interview nontelephone 
households (primarily in rural areas), with the mobile 
telephone brought to the respondent by field interviewers. 
This was designed to minimize mode effects by having 
telephone interviews conducted by the same interviewers as 
those conducted for telephone households. The response 
rates were high, even though in some cases the interviews 
had to be conducted outdoors in order to obtain reasonable 
reception. The most intensive use of mobile phones for 
household surveys is no doubt for the Finish Labour Force 
Survey — Kuusela and Notkola (1999). Out of some 97% of 
interviews completed by telephone, over 20% are carried 
out by mobile telephone. Although the average duration of 
mobile telephone interviews is somewhat longer than those 
of conventional telephone interviews, this is probably due 
to socio-demographic differences between the respondent 
groups. 


4.4 Future Technological Developments and their 
Effect on Telesurvey Methodology 


The rapid advances in technological developments in the 
areas of telecommunications and information systems make 
it very difficult to forecast their influence on survey work. 
Not all these technological changes will necessarily in- 
crease the potential for using advanced telecommunications 
technology for survey work. The problems raised by 
persons who have opted to ‘drop-out’ from the Internet 
(Katz and Aspden 1998) or from fixed line telephone 
service (see e.g., Gabler and Haeder 2000; and Kuusela and 
Vikki 1999) have already been mentioned. Furthermore, in 
some areas, such as market research and official statistics, 
technological developments may lead to a reduced reliance 
on surveys to gather information for decision-making. Thus 
Baker (1998) and Poynter (2000) predict that techniques 
such as data mining of existing data resources may become 
predominant for market research. Similarly, Scheuren and 
Petska (1993) discuss the possibilities for the use of 
administrative record systems for official statistics. How- 
ever, there still remain important areas (for instance for 
opinions and unobservable behaviour) in which surveys will 
remain the predominant source of data. The technological 
advances will open new possibilities for telesurvey work, 
though the required methodology might become more 
complex than that used today. 

One of the expected developments forecast for the near 
future is the integration of multiple communication devices 


and methods — telephony (fixed line and wireless), fax, 
internet, e-mail, videotelephony, data transmission, tele- 
vision transmissions etc. — Baker (1998). This implies that 
each individual will have access to a variety of tele- 
communication services possibly via the same physical 
instrument, which could be a mobile phone (e.g., via WAP 
technology), a PC or a TV set or some combination of 
these. Similarly, the survey taker may be able to gain access 
to respondents via several different modes. See Ranta-aho 
and Leppinen (1997) for some of the issues involved in this 
plethora of possible avenues of access. It is envisaged that 
the recipient will have a large degree of control over 
whether to receive communications at all and, if, so by 
which mode. This is already now ensured for many users by 
means of sophisticated devices for screening, forwarding, 
message transfer, multiple message transmission etc. On the 
other hand, the degree of control of mode of transmission 
by the sender will probably decrease as a result. 

The implications of these developments for survey work 
are that mixed mode surveys and possibly multi-frame 
methodology will have to become predominant. Although 
we consider that overall telecommunications coverage will 
increase to some saturation point that is close to universal 
coverage, it seems unlikely that any given mode of tele- 
communication will by itself provide virtual complete 
coverage. Furthermore, even when a single mode may 
provide practically complete coverage, it is not clear that a 
mixed mode approach, taking into account respondents’ 
mode preferences, is not preferable. The increased reliance 
of survey work on the voluntary cooperation of respondents 
practically dictates that we should offer the respondent the 
choice of mode. However it should be pointed out that 
mixed mode surveys are very expensive and that the present 
technology does not allow the simple transfer of question- 
naires developed for one mode (e.g., the CAI Blaise 
questionnaire) to another mode — e.g., to a paper form. 

The major problem that the new developments in tele- 
communications pose for survey design will probably be the 
choice of relevant frameworks and the allocation of sample 
units to modes of collection. Eventually it is envisaged that 
each individual will have a unique, permanent, personal 
communication number (or ID) through which he/she can 
be reached by a multiplicity of modes (written, oral or 
visual), via a variety of fixed line or wireless devices which 
could be at home, in the office or mobile. The choice of 
mode will be ultimately controlled by the joint decision of 
recipient and sender. While the idea of such a universal 
number (which would basically be an identity number) is no 
doubt anathema to libertarians, there is little doubt that it 
will eventually become acceptable, even if small activist 
groups may attempt to evade its use and even disrupt its 
proliferation. In fact standard universal identity number 
systems have been operating and are well accepted for 
several decades in many countries in Northern Europe and 
in Israel. The identity number in these countries is not 
regarded as confidential information and is widely used for 
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many administrative and commercial purposes. For 
example, in Israel personal cheques are required by law to 
include the person’s ID number, name, address and 
telephone number. 

Once such a system of unique communication numbers 
is operable, standard methods of sampling can be used. It 
may well be that complete lists of these numbers will be 
generally available — possibly with only limited geo- 
graphical or other information. This is the situation with 
respect to ID’s in many national registration systems. There 
are reasons to expect that a similar situation may prevail for 
communication numbers — initially at least in Europe rather 
than in North America. This could come about since the 
need for unlisted status might well be made redundant 
because of sophisticated screening techniques. Although 
screening may enhance the ease of non-response, the 
possibility of transmitting prior written messages by e-mail 
or voice mail could reduce the problem. 

Sampling from such lists would be simple but in most 
cases might be inefficient, since it could benefit only 
marginally from auxiliary information. While differenti- 
ation between personal and business contacts might be 
ensured by the listings, it is doubtful that any household 
information would be available. This dictates that the 
sampling and reporting unit would be the individual rather 
than a household. This is in any case the aim of many 
surveys and the usefulness of the household as a sampling 
unit for telesurveys is definitely doubtful, even under 
current practice. Household information, if required, would 
have to be obtained from the individual and include infor- 
mation on household size to ensure proper weighting for 
household characteristics. If the communications number- 
ing system ensures the allocation of a single number to each 
individual, no information is required on the modes of 
communication or their multiplicity. 

If listings of communication numbers are not available 
or if the problem of unlisted numbers does persist, some 
form of RDD will have to be used. This should not differ 
much from the RDD techniques currently employed. 
Assuming that the communication numbering system is 
indeed unique and universal and also arranged by some 
logic, efficient methods for sampling could easily be 
developed. Hopefully the numbering system will still bear 
some relationship to geography, via the individual’s 
permanent address. Otherwise local or even national RDD 
surveys will become extremely difficult to design effi- 
ciently. If sufficient information on the numbering system 
is available, the extent of out-of-scope numbers could be 
minimized. 

Since it is likely that choice of the mode of communi- 
cations will be largely under the control of the recipient, the 
question of allocation of sample units to mode of communi- 
cation will probably hardly arise. The survey taker will have 
to prepare a whole range of collection instruments suitable 
for the different modes of communication. These would 
have to include written instruments, such as faxed, e-mail 
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and Internet versions of questionnaires, oral instruments, 
such as traditional voice interviews and automated inter- 
viewing, and combinations of these. The integration of the 
data obtained from these modes of collection into a uniform 
data set would be a formidable but surmountable technolo- 
gical challenge. 

The almost utopian situation described above will 
probably take a long time to reach and in the interim 
suitable methodologies will have to be developed to deal 
with the problems arising from the short-term developments 
in communications technology and their application. The 
necessity to move from telephone surveys based uniquely 
on fixed line telephones to some combination of mobile and 
fixed-line telephone situation will have to be dealt with very 
shortly, as pointed out in section 4.3. Basically multiple 
frame methodology developed to cover both telephone 
households and non-telephone households can easily be 
extended to deal with this. The development of suitable 
frames and/or RDD sampling methods for mobile tele- 
phones still has to be carried out, but the necessary 
principles are available. The problem of combining data 
obtained from mobile phones which are basically personal 
devices with that obtained from fixed-line telephones, 
which are still fundamentally household devices, will have 
to be worked out to ensure proper weighting. To ensure 
this, sufficiently complete information on all the commu- 
nication devices available to the household is required. 

In conclusion, the advances in telesurvey methodology 
over the past few decades, which have made telephone 
surveys a viable and predominant survey instrument, will 
have to be continually updated to deal with the ever- 
changing developments in telecommunications technology 
and it usage. However the basic elements for these new 
developments are available and will continue to allow the 
use of advanced options to obtain high quality survey data 
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Regression Composite Estimation for the Canadian Labour Force Survey 


with a Rotating Panel Design 


AVINASH C. SINGH, BRIAN KENNEDY and SHIYING WU’ 


ABSTRACT 


We consider the regression composite estimation introduced by Singh (1994, 1996; termed earlier as “modified regression 
composite” estimation), a version of which (suggested by Fuller 1999) has been implemented for the Canadian Labour Force 
Survey (CLFS) beginning in January 2000. The regression composite (rc) estimator enhances the generalized regression 
(gr) estimator used earlier for the CLFS and the well known Gurmey-Daly ak-composite estimator in several ways. The main 
features of the rc-estimator are: (a) it considerably improves the efficiency of level and change estimates for key study 
variables resulting into less volatile estimate series; (b) it is calculated like the gr-estimator as a calibration estimator such 
that all the usual poststratification controls used in gr as well as the new controls corresponding to correlated variables from 
the previous time point are met; and (c) it respects the internal consistency of estimators without having to calculate part 
estimates differently as residuals. The main innovations used in rc-class of estimators entail: (a) using the idea of working 
covariance matrix in estimating functions as an alternative to superpopulation modeling for defining regression coefficients 
for the predictors in the gr-estimator, (b) treating random controls (the ones based on the key correlated variables from past) 
as fixed, while computing the regression coefficients, similar to two-phase estimation, and motivated from the working 
covariance idea, and (c) that of the use of micro-matching to obtain previous time point’s micro-level auxiliary information 
for realizing higher correlation with the present time point’s study variables. As a by product, a new version of the ak- 
estimator which uses the micro-matching based predictors from past rather than the traditional macro-level is recommended 
in the interest of higher efficiency gains. The paper also presents an interesting heuristic justification of the smoothness 
feature of composite estimates using the amortization idea. Empirical results based on the Ontario 1996 CLFS data are 


presented for comparison of various estimators. 


KEY WORDS: Generalized regression; Modified regression; Estimating functions; Regression calibration. 


1. INTRODUCTION 


In the case of repeated surveys with partially overlapping 
samples, it is well known (see, e.g., Cochran 1977, Ch. 12) 
that estimates of level at a point in time and change between 
two time points can be improved by regressing the usual 
cross-sectional estimator (typically regression or simply 
Horvitz-Thompson) on the new predictors provided by the 
correlated observations on the overlapping subsample from 
the previous time point. Such methods of estimation belong 
to the class of composite estimation, and a simple version 
of which known as the k-composite estimator was proposed 
some time ago by Hansen, Hurwitz and Madow (1953), and 
examined further by Rao and Graham (1964), Binder and 
Hidiroglou (1988) provide an excellent review of the 
literature on estimation with repeated surveys. Note that 
there is an associated loss of efficiency in estimates 
aggregated over several time points due to increased 
positive correlation between composite estimates of 
successive time points. This is, however, probably a small 
price to pay because it is not the aggregate, but the level and 
change estimates that need more precision. The ak- 
composite estimator of Gumey and Daly (1965) provides an 
improved version of the k-composite estimator by reducing 
the variance further, an alternative simpler justification of 
which was provided by Wolter (1979). 


The composite estimator considered in this paper was 
developed in the context of the Canadian Labour Force 
Survey (CLFS). The CLFS is a monthly survey that follows 
a rotating panel design with six panels. In any two 
consecutive months, five sixth of the households form the 
overlapping sample. It was in January 2000 that the CLFS 
started using a version (suggested by Fuller 1999) of the 
composite estimators introduced by Singh (1994, 1996) 
termed originally as “modified regression composite” 
estimators, which will be referred to in this paper as simply 
“regression composite” or rc-estimators. Before January 
2000, CLFS used the generalized regression (gr) estimators 
of Cassel, Sarndal, and Wretman (1976) and Sarndal (1980) 
which were based on only cross-sectional (i.e., present 
month’s) data. It has long been felt that the estimator for 
CLFS could be improved using the composite estimation 
idea in the sense that estimates of level and change would 
be more efficient, and hence the resulting series would be 
more stable, i.e., less volatile. There are four goals that the 
rc-estimator attempts to meet in modifying the gr-estimator: 
(i) It should considerably increase the efficiency of level 
and change estimates so that the estimate series 
becomes smoother or less volatile. 

(ii) It can be computed as a calibration estimator like the 
gr-estimator so that the existing estimation software 
system can be used with little modification, 
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(iii) The final calibrated weights should continue to 
satisfy the usual demographic and geographic 
controls used in the gr-estimator in addition to some 
new controls based on past month’s variables, and 

(iv) The estimator should have the internal consistency 
property in that the part composite estimators add up 
to the whole, e.g., estimates for Employed (&), 
Unemployed (U), and Not in the Labour Force (N) 
should add up to the total eligible population in the 
domain of interest. 


The ak-estimator was studied by Kumar and Lee (1983) 
in the context of CLES, and it was found that it didn’t give 
substantial gains in efficiency as required by goal (i). The 
goal (iv) was, of course, known to be not satisfied by the 
ak-estimator because the (optimal) coefficients a and k 
used for combining several present month’s estimators (in 
fact three of them, one is the usual estimator based on the 
present month, and the other two are built on predictors 
from the past month) turn, out to be specific to the charac- 
teristic such as E. A solution (although rather undesirable) 
is to designate one of the components as least important 
(say, N) and then obtain its estimate as a residual. The 
goals (ii) and (iii) can, however, be met by the ak-composite 
weighting suggested by Fuller (1990), and studied for the 
US Current Population Survey context by Lent, Miller, and 
Cantwell (1994, 1996). The goal (ii) is important especially 
for unplanned study variables for which the coefficients 
(a,k) are not known in advance. The rc-estimator meets all 
the four goals, in particular the goals (i) and (iv), by 
making use of the following three innovations: 

(i) The design-based estimation in the presence of 
correlated predictors can be cast in an estimating 
functions framework as defined by Godambe and 
Thompson (1989), and then use the idea of working 
covariance matrix as in Liang and Zeger (1986) to 
obtain an alternative to the superpopulation 
modelling to compute regression coefficients. The 
resulting regression estimates, like gr, are only 
suboptimal under the design randomization. 

(ii) The previous month’s full sample composite 
estimates used as regression controls for present 
month’s estimation can be treated as fixed using the 
working covariance idea for computational simplicity 
without violating the design consistency property. For 
variance estimation, the extra variation due to random 
controls should, of course, be accounted for. 

(iii) Using micro-matching of the present month’s over- 
lapping subsample with the previous month, infor- 
mation about key study variables from the previous 
month is augmented to the present month’s data. 
These now serve as additional covariates deemed to 
be highly correlated with the present month’s study 
variable. 


These innovations allow for computation of all estimates 
using the gr-system, thus avoiding the need of having to 
compute parts of estimates as residuals in the interest of 
internal consistency. The feature of micro-matching gives 
rise to desired gains in efficiency. In practice, it would often 
be the case that some of the present month’s respondents in 
the overlapping sample were nonrespondents in the 
previous month, and so imputation might be necessary. In 
the case of CLFS, this is a small fraction, and the Hot Deck 
method with donor classes defined by demographic, geo- 
graphic (subprovincial economic regions), type of area 
(rural/urban), present month’s employment status, and 
industry group is used to fill in the missing values. It may 
be noted that sometimes imputation may be necessary not 
due to nonresponse at the previous time point, but due to 
the household’s move. Assuming that on the average, 
households that move in the dwellings sampled at the 
present time ¢ are similar to the households that move out at 
t, then even though movers may have different employment 
characteristics than nonmovers, the imputation for movers 
is not expected to introduce any new bias as current 
month’s employment status among other covariates is taken 
into account. 

In the concluding section 6, a method is suggested to 
diagnose the impact of this imputation. This impact may be 
serious for surveys with high fraction of previous month’s 
missing values for the present month’s respondents in the 
overlapping subsample. A possibly simple way out would 
be to redesign the questionnaire so that the interviewer is 
prompted by the instrument CATI software (computer 
assisted telephone interviewing commonly used now-a- 
days) while administering the interview in second or later 
months, whether the respondent was nonrespondent at the 
previous month. If so, then the interviewer administers a 
rather short supplementary questionnaire in order to elicit 
the respondent’s employment status for the previous month. 
This idea is similar to the method suggested by Hansen- 
Hurwitz-Madow for completely nonoverlapping repeated 
surveys, but each respondent is asked questions for the 
present as well as the previous time point, see Cochran 
(1977, page 355). 

The organization of this paper is as follows. Section 2 
presents a heuristic motivation using the amortization idea 
of why composite estimation, in general, is expected to 
provide desired smoothing of the estimate series. Section 
3 defines various estimators, and discusses their compu- 
tation via the gr-system. A new version of the ak-estimator, 
denoted by ak*, is also proposed. The estimator uses 
predictors from previous month based on micro-matching, 
and is expected to give high gains in efficiency. Section 4 
considers variance estimation by the currently used method 
of jackknife. An empirical comparison of the estimators is 
presented in section 5 using the Ontario 1996 CLFS data. 
Finally section 6 contains concluding remarks. 


Survey Methodology, June 2001 


2. SERIES SMOOTHING BY COMPOSITE 
ESTIMATION: HEURISTICS 


In this section, we present an interesting heuristic 
justification (based on the amortization idea rather than the 
shrinkage) of why smoothing of the estimate series is 
expected by composite estimation. (Using only the 
shrinkage idea, the series can be smoothed but it may not 
cross the original series often enough. With amortization, 
however, the left-over part after shrinkage is accounted for 
gradually over time, thus allowing for the smoothed series 
to cross the original one more often.) Consider the panel 
rotation scheme similar to that of the CLFS and let y denote 
the fraction of the panels rotated out; in the case of CLFS, 
y is 1/6. Denote the cross-sectional estimator (typically gr) 
at time f based on all panels, i.e., the full sample, by F’,,, the 
estimator based on only the birth (i.e., rotate-in) panel by 
B,, and the one based on nonbirth panels (i.e., the 
subsample at ¢ overlapping with the past sample at ¢- 1) be 
Be Similarly, denote the estimator based only on the death 
(i.e., rotate-out) panel by D,, and the one based on 
nondeath panels (i.e., the subsample at ¢-1 overlapping 
with the present sample at ¢) be D,. We have 


Fo= y B,+(-y) B, (2.1a) 


ayy Deeley) ae, (2.1b) 

Suppose, the series {F',} is too volatile, and we wish to 
smooth it. In the following it is assumed that there is no 
rotation group bias (Bailar 1975), i.e., different rotation 
groups have the same expected value. Thus F, is unbiased 
but may be unstable. This set-up is the traditional one for 
composite estimation in which different unbiased estimates 
are combined optimally to get a more efficient estimate. 
However, see the discussion at the end of this section for an 
alternative perspective on composite estimation in the 
presence of rotation group bias. Now denote the smoothed 


series by {C,}, and consider the identity: 


Pa = CC PACES See CE Se) (2.2) 


The above relation can be interpreted as follows. The 
estimate C,, at f-1 is adjusted by the fluctuation 
(fF, - F_,) at the next time point ¢ in the F’-series, and the 
existing gap (F_, -C,_,) at the time point ¢- 1. If we 
define C, after full adjustments for these two differences, 
then C, would be the same as F, and there would be no 
smoothing of the F-series. This suggests that the adjust- 
ments for the differences (F,- F,_,) and (F,_, -C,_,) 
should be accounted for only partially as C-series moves 
from C,_, to C,. The remaining portions of the differences 
should be amortized gradually over future time points. All 
these adjustments should be done without affecting 
unbiasedness of the estimator C,. The difference 
(F_, — C,_,) is zero in expectation assuming unbiasedness 
of C,_, and F’_, (which is so under the assumption of no 
rotation group bias) and therefore amortizing parts of it 
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would not affect unbiasedness of future estimates C,. 
However, the difference F, - F_ 1 is not zero in expecta- 
tion, and care should be exercised i in amortizing part of this 


difference. Observe that 


eS > LB D,.1) " NGG B,) a yD, D,_,)- (2.3) 
The first term on the RHS is the change estimate based 
on common panels, while the second and third terms 
represent birth and death effects at t and ¢ - 1 respectively. 
The last two terms are zero functions (i.e., are zero in 
expectation) but the first one is not. (Fortunately, the first 
term is expected to be stable as it is a difference of two 
highly correlated estimates.) Therefore, it is the second and 
third terms that should be amortized. Now, write (2.2) as 


Ist Sy *(BED__,) +y(B,-B,) 


4 Oe ee UF imei) 


5 Ci BoD.) +y(B,-B,) 
CO eSaead) Cal 


=C,,+(B,-D,_,)+y(B,-B,)+(D,_,-C,_,). (24) 


and define two amortization factors 6 B 5, , between O and 


1, and then define the smoothed series {C,} as 


C, Zz C4 +(BisD,_1) i 51, y(B,-B,) ‘a 5,,(D,1 a C,_4)- (2.5) 

The term with 6,, in (2.5) represents shrinkage of the 
birth effect at t which C, tries to account for, while the term 
with 6,, refers approximately to shrinkage of the death 
effect at the past time (¢- 1) which C, tries to make up for 
the present time ¢. Also, it would be desirable to set 5, < 5, . 
in order for the series {C,} to track {/’,} better so that they 
have similar trend over time, i.e., give more importance to 
the current birth effect than the past death effect. (In fact, a 
rigorous justification under fairly general conditions of 
why one should set 6,,< 6,, comes from optimality conside- 
rations in which variance of C, is minimized to obtain the 
best linear combination of three unbiased estimators, F’,, 
Cr, eg ANGE ot Ch eee , of the present Month: S 
seerten total; see (2.8) at the end of this section for the 
actual expression.) Now, to see the connection with the well 
known composite estimates defined in the next section, 
define 0<a,,6 <Visothat 0, = 1—5,.0,, =I > a. We 
have 


Oia Carre (Cb) + ay) y (B,-B,) 


+(1 =i), ~4,) (a yi C,_4)- (2.6) 

It is interesting to note that if b, = 0, there would be no 
dampening of the birth effect, and the C-series is expected 
to be closer to F-series, i.e., there is less smoothing and the 
two would cross each other more often. If a, = 0, the past 
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effect represented by (D,. , ~ C,.,) is dampened less. This 
would imply more smoothing of the F-series, and the two 
series are expected to cross each other less frequently. 
Finally, if a,,b,>0, then the behaviour of the C-series 
relative to the F -series would be somewhere in the middle. 
Moreover, if b, is high (close to 1), there would be quite a 
bit of smoothing of the F-series because there is high 
amortization of both the birth and death effects. In these 
situations, one would expect sustained gaps between F and 
C series over time before they cross each other. Notice that 
parts of the term y(B, -B,) that get amortized over 
t,t+1,... decrease as ¢ increases. aN are given by 
b,y(B, -B pine (Oy: ea) ony (B, =D): . Similarly, the 
amortized parts of Dr 17 G4) are 
(b,+4,) O,, ~C,_1), (D,.1 +4,,4) (b,+4,) (D,4- ny a 

Clearly, when b, is large, it will take several time points 
for completing the amortization. However, as explained 
earlier, this would not introduce bias because the effects 
being amortized are zero functions under the assumption of 
no rotation group bias. 

The expression (2.6) can be cast into a more familiar 
expression of the composite estimator as follows: 


C,=C,_,+(B,-D,_,)+(1-b,)(F,-B,) 
Hab )(D nC aC, Dp, Cla) 
C5 A(B)Ds) bab) Bs Doe )) 
ta(C,_,-D,_,) (2.7b) 
= biC. -GieD {-B )\ra(Ce D2) WO 7c) 
=F, +(b,+a,) (C,_,-D,_,+B,-F,)+a,(F,-B,). (2.74) 


The expression (2.7d) coincides with the ak-estimator 
(see next section) when a, = a and b,+a, =k. In practice, 
the values of a, and b, can be determined optimally or 
suboptimally using regression (see next section). The partial 
regression coefficients a,, b, satisfy 0<a,<b,<1 in general, 
because the direct estimator TAS expected to be more 
positively correlated with the predictor F, + De -B Up 
1, D, Loe, -B ,) than with the predictor aya ,> both 
predictors being unbiased estimates, like C,_ Pe of the 
population total parameter at the previous time point ft - 1. 
It follows from (2.7c) that the estimator C, can be written 
as a linear combination of the three unbiased estimators 
mentioned earlier, and is given by 


C, = Gaba, )F id b(C,_, -B Die) 


Ha GC aeuD, 5). (2.8) 


The above heuristic motivation corresponds to the variance 
reduction considerations under the assumption of no 
rotation group bias when combining three unbiased esti- 
mators of the population total at ¢. In the presence of rota- 
tion group bias, however, all the three estimators become 
biased with possibly different magnitude and direction, and 
what composite estimation does is to adjust each one of 
them so that the adjusted value for each is equal to a 
common value given by the composite estimator. (For 
example, in the case of two estimators 6, and 6, of 0, the 
linear combination | r 6, ci Oe: d) 6, cal be wiiten as 
6, +26, - -6 ») or 6, Pare a) (6, -6 ,) implying that the 
ee eroianl efenion are waited de to con- 
verge to a common value.) The relative weight in com- 
bining the three estimators depends on the criterion of 
minimum variance. Ideally, it should be based on the mini- 
mum MSE criterion, but it is hard to get a handle on bias 
because it can’t be estimated. Clearly the composite esti- 
mator is not bias free, and it can only be speculated that the 
overall bias of the estimator is reduced by compositing. 
Similarly if, instead, a suboptimal regression is used in 
constructing the composite estimator (as in rc-estimation, 
see the next section), then what composite estimation does 
is to adjust the sampling weights in the full sample (which 
are generally gr-weights) so that F,-(B,-D, ,), and D, , 
with adjusted weights become equal to Gh ,> the C_, res 
as new controls in the calibration step. This is another way 
of adjusting the three estimators to a common value, but 
again bias of the resulting composite estimator remains 
unknown. The above discussion of two perspectives on 
composite estimation has some similarity with the dual 
property of poststratification in terms of both variance and 
(coverage) bias reduction, see Singh and Folsom (2000). 


3. COMPOSITE ESTIMATORS: NEW AND OLD 


We start with the cross-sectional estimator at time t of 
the total ,(t) defined as gr, which is given by 


va yaAN y(t) we, (tk), (3.1) 


w,, (tk) 
=d(tk| 1 +x,(0)' (XO) AOXO)™ ©) -#,0)], 8.2) 


where d(t,k)’s are the initial design weights adjusted for 
nonresponse, x, (t) is a p-vector of covariates used for cali- 
bration (or poststratification), X(t) is the n(t) x p matrix of 
x-observations, n(t) is the sample size, A(t) is diag 
(d(t, k)), t,(t) is the known p-vector of calibration 
controls, and %,(t) is the corresponding vector of 
expansion estimates based on d-weights. In terms of the 
notation F * B, and B, of the previous section, re here can 
be taken as the gr-estimator (3.1), and B, is gr-estimator 
based on nonbirth panels given by 
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B, ae si ee areeeigrecy y,(t) War (t, k), (3.3) 


where s(t|t-1) is the subsample at t matched with the 
sample at t- 1. The estimator B, is also a gr-estimator, and 
is given by 


Bava Da 7 eat y,(t) Ww, (4k), (3.4) 


where the sum is over the subsample defined by the birth 
panel at t. 

The ak-composite estimator uses the macro-level past 
information for the new predictors, and can be defined as 


Cc 


Moe Bea kG, D,_,+B,-F,) + a(F,-B,) 


t-l(ak) *~t 


= F,+(k-a) (C,_1¢94)-D,.1 + B,-F,) + 4(C,_.¢q4)-D,-)- 3.5) 


D,. 
Here the coefficients a,k for level estimation are 
obtained by optimally regressing F, on the two predictor 
zero functions, based on the past information, namely, 
Cor, HDi By and Co), 3a Dias Thus, cauk 
depend on the sample design as well as on the study 
variable y, in particular, they are not even the same for level 
and change estimates for the same y. For change estimation, 
F,~C,_ qx» and not F’, is regressed optimally on the above 
predictors. In practice, a, k are estimated by performing a 
grid search on the interval (0,1) such that the variance of C, 
is minimized. As mentioned earlier, typically a is smaller 
than k. In defining the above two new predictor zero 
functions based on past information, two estimators of 
t,(t-1) are first formed: one is Da. based on the nondeath 
panels at t-1 (i.e., subsample at t-1 matched with the 
sample at 2), and the other is F,+(D,_, -B,) which is the gr- 
estimator at time ¢ adjusted for change from t-1 to ¢ 
estimated from the common sample. Clearly, if there is no 
overlap in the panel design, then all the predictor zero 
functions become no longer meaningful resulting in no 
change in F’, by composite estimation. Similarly, if there is 
a complete overlap, then B, = F,, and again there is no 
effect on F, of composite estimation. This may at first seem 
counter-intuitive, because the past data (y,_, ) is correlated 
with the present (y,) due to sample overlap. However, 
complete overlap amounts, in principle, to collecting a 
single sample of multivariate data on y with elements 
corresponding to y at different time points. Using this 
analogy, there is no room for improvement (in the design- 
based framework) as there is no larger sample with 
additional information. In the case of no overlap, additional 
information is there but it doesn’t help as it is uncorrelated. 
Note, however, that at the first stage, psu’s ( primary 
sampling units) in CLFS remain common over several years 
before they are rotated out. Therefore, efficiency gains due 
to partial overlap are realized mainly from the reduction of 
the second stage variance component. 
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Furthermore, note that the estimator Cyay USES past 
information in the univariate sense in that for the study 
variable y, past information about only y,_, is used. If new 
predictors based on several variables such as Sper eee, ee 
from the past are also used for the study variable y, then the 
composite estimation becomes multivariate. However, the 
optimal choice of the (a, k) coefficients for the multivariate 
case can be quite cumbersome. 

The rc-class of estimators is given by 


Caoe F + Brac \C (é t-l(rc) ey +B,-F,) 
i: 4420) [es -D,,,} (3.6) 
where C denotes the t-1 estimator for the study 


t-1(tc) 
variable (y) after the (t- 1)-calibration weights are further 


calibrated to meet the controls used for poststratification by 
gr at time ¢. Thus C sees) is an estimate of the population 
total at ¢ for the y-variable at t-1. The starred D* 
signifies that it is based on the subsample at t matched ith 
the sample at t-1, but uses the gr-weights at ¢ as the y 
values from t-1 are augmented to the sample at t by micro- 
matching. (Note that the estimator De involves, in 
general, imputed values, and may suffer on bias due to 
imputation. For a diagnosis and adjustment for this bias, see 
section 6.) The coefficients b,,.. and a,,,. are computed 
similar to gr of (3.1); see below for more details. These 
coefficients are suboptimal unlike (a, k). However, like 
(a, k), they are y-specific, and in the case of multivariate 
they depend on the key set of study variables chosen from 
past for new controls, but they can be computed easily as 
they are suboptimal in nature. Thus with rc-estimation, it is 
fairly easy to introduce more predictors. The predictors 
(CRSDS) and (Ge Dy BE) scan be: termed 
respectively as level-driven and change-driven as in Singh, 
Kennedy, Wu and Brisebois (1997). The reason for this is 
that not only the former is a difference of two level esti- 
mates, and the latter a difference of two change estimates, 
(C,, -F,) and (D,_, - B,), but that the former tends to 
provide high efficiency gains in level estimation over what 
can be obtained in the presence of the latter, and similarly, 
the latter provides high efficiency gains in change esti- 
mation over what can be achieved in the presence of the 
former. 

The idea of using the micro-level past information for the 
new predictors in rc-estimation can be applied to the ak- 
estimator, and thus a new estimator ak* can be proposed. 


= F,+(k*-a 5) CxgaaD +B,-F,) 


bak Gop rei), (3.7) 


The control CA Mak"), denotes the(t-1) calibration 
estimator for y after the ak” -composite weights are further 


Ciak *) 
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calibrated to meet the controls used for poststratification by 
gr at t. (Here the ak’-composite weights are similar to the 
ak-composite weights of Fuller (1990) where the composite 
estimators for a set of key y-variables serve as additional 
controls in the usual gr to obtain a set of final calibration 
weights. This allows for the ak-composite estimator to be 
computed as a calibration estimator.) The main differences 
between the various estimators defined above lie in the defi- 
nition of regression coefficients (optimal vs. suboptimal), 
and that of the predictors (macro-level vs. micro-level use 
of past information). Special cases of the above composite 
estimators can be obtained as described in Singh, et al. 
(1997) by using only one of the two predictors. For C,,,, 
if a=0 (i.e., only change-driven predictor is used), we get 
the well known k-composite estimator which can be termed 
as the ak2-estimator in the present context. If a=k, ie., 
only level-driven predictor is used, we get a new composite 
estimator C,/,,,, which can be termed as the ak -estimator. 
Similarly for C,/,,), we get two more new composite 
estimators ak*1 and ak*2. For C (ec)? with only level- 
driven predictor, we get the rcl-estimator, termed earlier as 
MR1 in Singh and Merkouris (1995). With only change- 
driven predictors, we get the rc2-estimator termed earlier as 
MR2 in Singh, et al. (1997). 

As mentioned earlier, the rc-estimator is computed as a 
gr-estimator of (3.1), and therefore, it can be expressed as 
Tact) = Bee y,(t)w,,(t,k). The X(t)-matrix is 
expanded to n(t)x(p+2q) matrix X(t)” where 2q¢ 
represents the number of new predictors, the factor 2 signi- 
fying the pair of level-driven and change-driven predictors. 
The (random) control totals C,_,,,. corresponding to the 
key set of y-variables from t-1 selected for composite 
estimation are treated as fixed (during the computation of 
regression coefficients) like the other (nonrandom) gr- 
controls. Now, since the level-driven predictor can be 
written as 


D,., =u lay) = Dean y,{t-1) w,, (t, k) 


“D9, TE OSCE Rel aed) 


the column of the X (t)*-matrix corresponding to this pre- 
dictor consists of n(t)-values of (1-y)7! y,(t-1)1 kes(tt-1)° 
Similarly the change-driven predictor can be written as 

E Ane D os =i) es 

Lacan WO + AD OYED -YOLeesta-) er ® (3,9) 


and the corresponding column of the X(t)* matrix consists 
of the n(t)-values of y,(2) +(1 -y)"Oy(-1)- yp(0)) Lge see-ty 
Once the X(t)” matrix is defined, the gr-system can be used 
to compute the calibration weights w_.(t,k) as in (3.2). 
Note that the calibration weights w,,(t, k) can be used for 
estimation of all study variables although they depend 
explicitly only on the key set of study variables chosen for 
the new predictors from correlated past information. Also 


note that although the rc-estimator of (3.6) was defined as 
the gr-estimator plus regression-adjustments for the new 
predictors, computationally it is convenient to perform a gr- 
calibration on the design weights when all the old and new 
calibration controls are considered simultaneously. This 
way computation for the multivariate rc-estimator is not 
much different from the univariate rc-estimator. Alterna- 
tively, one could compute the rc-estimator as an adjusted gr 
as in (3.6), but the coefficients for the new predictors would 
be partial regression coefficients, and therefore do not have 
the standard form of the gr-coefficients. 

Finally we note that with composite estimation, one 
would expect higher efficiency gains for change estimates 
(Ce C7 ivs Ee) than those for level estimates 
(C, vs. F,). To see this, consider a simple identity: 
VG) Ve) VES ~2Cov( FFs nc Typi- 
cally VF) = VF ._,) = 0, (Say), then the above can be 
reduced _ to VF, Sh aoa Zon = Ppp) Similarly, 
V(C,-C,_,) = 20,,(1-p,,). Thus the change efficiency is 
approximately the level efficiency times (1 -p F ith= pews 
It follows that if the new predictors for composite esti- 
mation increase considerably the (positive) correlation 
between C, and C,_,, then the change efficiency will highly 
dominate the level efficiency. 


4. VARIANCE ESTIMATION 


The CLFS currently uses delete-one psu jackknifing to 
find variance of the gr-estimate. The method of jackknifing 
is valid (for cross-sectional surveys) if the psu-level 
estimates have identical mean and variance, and the psu 
selection can be treated as with replacement. When psu 
selection is without replacement the variance estimate 
becomes conservative if the (common) covariance between 
the psu-level estimates is negative. This is generally the 
case. For repeated surveys, a third condition that psu’s are 
common (or connected) over time is needed. When this is 
the case the survey can be viewed as cross-sectional by 
treating the vector of observations (psu-level estimates) 
over time as a single observation collected at the conceptual 
end point in time. In the rotating panel design of the CLFS, 
psu’s are not rotated out for a number of years, but the 
within psu units are rotated every six months. Each psu in 
the CLFS corresponds to a single panel which is either birth 
or non-birth. Note that to meet the conditions of 
jackknifing, it is not necessary that the same set of units be 
used to obtain psu-level estimates. The condition that psu- 
level estimates have common mean and variance within a 
stratum is reasonable on the grounds that the panel 
estimates have common mean and variance. For composite 
estimation, although birth and non-birth panels are treated 
differently, panel-level composite estimates should have 
identical mean and variance unconditionally on the panel 
assignment. This is so because the panels are assigned at 
random; a panel could have been birth with probability 
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y= 1/6 and non-birth with probability 1-y=5/6. The 
resulting unconditional variance estimate will not be 
smaller than the one obtained conditionally on the panel 
assignment. Thus the method of jackknifing is expected to 
provide a conservative variance estimate in the CLFS 
context. Note that the above considerations for measures of 
uncertainty do not involve rotation group bias that may be 
present. 


5. EVALUATION RESULTS 


The numerical results are based on 1996 Ontario CLFS 
data, see Singh, et al. (1997). The auxiliary variables for gr 
are population counts corresponding to 16 age-sex groups, 
11 economic regions, 10 census metropolitan areas, and 6 
panels. Each panel control specifies 1/6 of the 15+ 
population. The new controls (30 in all) for rc corres- 
ponding to only change-driven predictors are: employed, 
unemployed and not in the labour force by age (young and 
old) by sex groups for a total of 12, employment by industry 
categories for a total of 16, and 2 employment by full/part 
time categories. In fact, these 30 new controls reduce to 
only 28 because of linear dependence. The multivariate rc- 
estimator involves these 28 extra controls, while the uni- 
variate rc involves just one extra control. The average 
relative efficiency shown in various tables is computed as 
the average variance of gr over 12 months of 1996 divided 
by the average variance of the composite estimator over 12 
months. 


5.1 Macro-level vs. Micro-level Predictors 


For level-estimates, the correlation is computed between 
the current month level estimate (Le, F ,) and the predictor 


(e.g., the level-driven C, ,-D,_, at the macro-level), 
whereas for the change estimate, it is computed between 
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F,- C,_, and the predictors. The correlation is negative as 
expected because the estimate involving common panels is 
positively correlated with F, but expressed with a negative 
sign in the predictor. Recall that the composite estimator 
used is the ak with macro-level and ak* with micro-level 
predictors. 

It is seen from Table 1 for the four key variables 
(employed, unemployed, employed in Trade, and employed 
in Transportation and Communication (TRCO)), for each 
of the level-driven and change-driven predictors, micro- 
level predictors outperform macro-level in terms of high 
correlation. 

Between level- and change-driven predictors at the 
micro-level, change-driven is seen to out-perform level- 
driven. Similar results hold for other key variables. In view 
of these correlations, other evaluation results shown below 
pertain to only ak2, ak*2, and rc2 versions of composite 
estimates. The rc-estimator with both level- and change- 
driven predictors was not included in the interest of keeping 
down the number of extra controls. 


5.2 ak vs. ak* vs. rc (Efficiencies Relative to gr ) 


Table 2 shows the optimal coefficients (e.g., k for ak2 
estimator) and the corresponding relative efficiency over gr. 
The optimal coefficients were found via grid-search using 
the same 1996 data. (In practice, this should be based on 
past data). It is seen that the efficiency gains can be 
considerable as one moves from ak to ak*. The optimal 
coefficients vary for level and change estimates. The last 
two columns under each of level and change estimates show 
the reduction in efficiency if level-optimal coefficients are 
used for change estimates and vice-versa. Level-optimal 
coefficients seem to perform quite well for change 
estimates, in contrast to a drop in efficiency of level 
estimates when change-optimal coefficients are used. 


Table 1 
Average Monthly Correlation between Composite Predictor and Estimates for Level and Change (Ontario, 1996) 
Level Change 
Level-Driven Predictors Change-Driven Predictors Level-Driven Predictors | Change-Driven Predictors 
Variable Macro Micro Macro Micro Macro Micro Macro Micro 
Employed -0.27 -0.35 -0.23 -0.45 -0.35 -0.49 -0.57 -0.84 
Unemployed -0.26 -0.35 -0.24 -0.33 -0.22 -0.40 -0.39 -0.53 
Empl. Trade -0.58 -0.55 -0.58 -0.65 -0.65 -0.73 -0.91 -0.96 
Empl. TRCO -0.58 -0.55 -0.60 -0.68 -0.63 -0.70 -0.92 -0.96 
Table 2 
Average Relative Efficiency of ak and ak* over gr (Ontario, 1996) 
Coeff Eff (Level) Eff (Change) Eff (Level) Eff (Change) 
Level Optimal Change Optimal Level Optimal Change Change Optimal Level Optimal 
Optimal 
Variable ak ak* ak ak* ak ak* ak One ak* ak* 
Employed 0.42 0.72 0.48 0.95 1.05 125 1.28 2.43 0.72 Dal 
Unemployed 0.40 0.50 0.54 0.69 1.06 11 Wolly A 1.05 1.26 
Empl. Trade 0.79 0.84 0.95 0.98 1.43 1.67 2.36 4.97 0.88 4,22 
Empl. TRCO 0.84 0.87 0.95 0.98 1.59 1.88 360? 1 6.51 
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Table 3 compares rc (univariate and multivariate) with 
ak*, The possible values of b,,.») coefficients over the 12 
month-period for the univariate rc2 are summarized via 
mean, minimum and maximum. They can be compared 
with the corresponding optimal coefficients for ak*. The rc- 
coefficients seem to provide a compromise and lie some- 
where between level-optimal and change-optimal coeffi- 
cient values. The rc-efficiencies for the change estimate are 
quite at par with those for ak* but for level estimates, are 
somewhat lower. The efficiency gains at the aggregate level 
for which gr had controls are low but are high for domains 
without gr-controls. 

Table 4 presents possible loss in efficiencies for 
estimates obtained as residuals in ak*-estimation in the 
interest of internal consistency. It shows that caution should 
be exercised in practice when choosing variables for 
residual estimation or using compromise coefficient values 
in ak*-estimation of components of an aggregate. 


5.3 Change vs. Level Efficiencies of rc Over gr 


Table 5 shows that the approximate relation (see section 
3) between change and level efficiencies holds fairly well. 
It is seen that month-to-month correlation for rc-estimates 
for domains not having a corresponding population control 
in gr can be quite high compared to the correlation for gr. 


This, in turn, yields a high factor by which change effi- 
ciency exceeds level efficiency. 


5.4 Point Estimate and SE of Difference Between 
rc and gr 


Table 6 shows monthly estimates (and SE of level and 
change estimates) for the variable ( employed in trade at the 
Ontario level) for gr and rc. The corresponding values for 
the monthly difference (rc -gr) are also shown. It is seen 
that the differences between re and gr are not significant in 
general. Efficiencies (not shown here) of annual average 
and quarterly estimates of rc and gr were also computed. As 
expected, due to serial correlation, there may be a loss in 
efficiency over gr. However in terms of the coefficient of 
variation, this is likely to be of no practical consequence. 


5.5 Time Series of Level Estimates 


Figures 1(a) and (b) show level estimates of employment 
for Ontario for the period 88-96 for gr and rc without and 
with seasonal adjustment. (The X11-ARIMA method was 
used.) Figures 2(a) and (b), show employment for the 
industry group “Trade”. At the provincial level, aggregated 
over the industry group, there is similarity between gr and 
rc (seasonally adjusted or not) series because the gr- 
estimates have high precision to begin with. At the domain 


Table 3 
Average Relative Efficiency of rc over gr (Ontario, 1996) 
Coeff Eff (Change) 
rc -univariate 
(level or change) ak* rc Ic rc rc 
Variable Avg Min Max Level Change (univariate) (multivariate) (univariate) (multivariate) 
Employed 0.88 0.81 0.90 0.72 0.95 1.05 1.05 2.39 2.46 
Unemployed 0.60 0.53 0.65 0.50 0.69 ils? 1.12 1.31 1.33 
Empl. Trade 0.96 0.94 0.98 0.84 0.98 eA 122 4.98 5.07 
Empl. TRCO 0.95 0.93 0.97 0.87 0.98 1 339/ 1.42 7.47 Tae? 
Table 4 
Average Relative Efficiency of ak* and rc over gr from Ontario, 1996 (Regular vs. Residual) 
Level Change 
Variable ak* Coeff Eff (ak*) Eff (rc) ak* Coeff Eff (ak*) Eff (rc) 
Agriculture (regular) 0.91 DS) 232 0.97 4.88 5.22 
Agriculture (residual) NA 0.63 232 NA 3.90 e2e 
NILF (regular) 0.74 1.26 1.07 0.95 1.96 2.01 
NILF (residual) NA 121 1.07 NA 1.95 2.01 
Table 5 
Relation Between Level and Change Efficiencies for rc (multivariate) over gr (Ontario, 1996) 
Variable Change Eff Level Eff | Change Eff/Level Eff (1-p )(1-p,,.) p Pre 
Employed 2.46 1.05 2.34 2.65 0.77 0.91 
Unemployed 1.33 2 1.19 1B 0.50 0.59 
Empl Trade 5.07 22 4.16 3.80 0.79 0.95 
Empl TRCO 7.54 1.42 S311 
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Table 6 
Monthly Point Estimates for gr and rc and Their Differences (Ontario, 1996) 
(Level and Change for Employment in Trade, Ontario, 1996) 


Month Type er rc rc-gr 
January Level 886.5 (21.0) 858.9 (17.3) -27.6 (23.0) 
Change -25.8 (13.2) -21.0 (5.6) 4.8 (11.4) 
February Level 906.5 (22.9) 867.9 (17.6) 38.6 (24.6) 
Change 20 (14.2) 9.0 (4.7) -11.0 (12.5) 
March Level 927.1 (20.8) 874.1 (18.3) -52.9 (23.1) 
Change 20.6 (13.3) 6.2 (4.7) -14.4 (12.5) 
April Level 914.8 (20.3) 872.5 (OWED) -42.3 (22.4) 
Change -12.3 (13.4) -1.6 (5.1) 10.7 (12.5) 
May Level 912.8 (18.9) 887.6 (17.0) -25.1 (21.8) 
Change -2.1 (13.0) ISI (5.7) 17.2 (11.6) 
June Level 908.1 (17.8) 888.6 Gy) -19.5 (21.5) 
Change -4.7 (12.3) 0.9 (4.9) 5.6 (11.9) 
July Level 899.9 (18.1) 881.2 (17.7) -18.7 (23.0) 
Change -8.2 (12.8) -7.4 (6.7) 0.8 (10.7) 
August Level 913.9 (16.9) 888.1 (18.3) -25.8 (22.6) 
Change 14.0 (11.5) 6.9 (5.3) -7.1 (10.3) 
September Level 886.6 (20.4) 876.4 (19.7) -10.2 (23.1) 
Change -27.3 (12.6) -11.8 (6.3) 15.6 (11.1) 
October _ Level 898.6 (22.9) 889.3 (19.3) 9.3 (26.1) 
Change 12.1 (13.4) 12.9 (6.6) 0.9 (11.8) 
November Level 911.2 (20.3) 902.3 (19.3) -8.9 (25.9) 
Change 12.6 (13.9) 13.0 (7.0) 0.4 (12.6) 
December Level 917.9 (20.5) 916.3 (19.0) -1.5 (26.0) 
Change 6.7 G25) 14.0 (6.1) TA (10.9) 


Note: SEs are shown in parentheses. 


level defined by Trade, however, the series are quite 
different. (Note that among numerous series that were 
examined, this particular series was chosen here to depict 
the extreme scenario for gaps between gr and rc series. For 
almost all other series, the two series crossed each other 
fairly often.) Since the gr-series is highly volatile, there is 
room for considerable smoothing by rc. Also note that 
because of expected high signal-to-noise ratio, seasonally 
adjusted rc series at the Trade-domain level looks consi- 
derably smoother than that for the gr-series; in fact, there is 
very little difference between with and without seasonally 
adjusted gr-series. It is also observed that there tends to be 
runs of consecutive periods where rc is either larger or 
smaller than gr. This is expected because of high values of 
the by, coefficients (Table 3), and high serial correlation 
in both series ( see Table 5). Interestingly, turning points in 
the gr and rc series tend to occur at (approximately) same 
time points though they appear somewhat dampened with 
rc due to higher serial correlation in rc-series. It may be 
noted that the gap between the two series would have been 
smaller if controls for level- driven predictors were also 
included. 


6. CONCLUDING REMARKS 


The previously used gr-estimator in CLFS showed insta- 
bility in change estimates and various domain level 
estimates. The rc-estimator provides smoother estimate 
series (which, in turn, renders change estimates more 
stable). The rc-method departs from the traditional ak- 
composite estimation in several ways, the main points being 
the use of micro-matching for collection of unit-level past 
information for common panels, and the use of regression 
calibration (like gr) to produce a set of final weights for use 
with all study variables. Three versions of rc were 
examined. Although this paper was mainly concerned with 
rc2, i.e., with change-driven predictors (because of the 
desired resulting smoothness in estimate series), it was 
found (although not reported here) that level estimates of 
some key variables can be further improved (in comparison 
to rc2) by including corresponding level-driven predictors. 
Thus, in practice, a good strategy might be to use a mixture 
of mostly change-driven and some level-driven predictors. 

The version of the rc-estimator currently implemented 
for CLFS was suggested by Fuller (1999), and can be 
expressed as 
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Figure 1(a) Employment in Ontario, actual 
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Figure 2 (a) Employment in Trade, Ontario, actual 
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Figure 1(b) Employment, Ontario, seasonally adjusted 


Figure 2(b) Employment in Trade, Ontario, seasonally adj. 
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where a is prescribed ( 1/3, say, but in general could be y- 
specific), and the coefficient Dieca) 18 Computed using the 
gr-system as in rc-class of estimates. A simple interpretation 
of (6.1) can be obtained by comparing with the ak*- 
estimator of (3.7). First write (3.7) as 
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Now, for (6.1), a can be roughly viewed as the ratio of 
the two optimal coefficients a*,k*, and the factor k* 
outside the square brackets of (6.2) is replaced by the 
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(suboptimal) regression coefficient Drecay* Thus Cy 09) is 
not equivalent to the optimal ak*-estimator, but some 
optimality could be preserved (if a is made y-specific) in 
setting the relative contribution of change and level driven 
predictors. Note, however, that the problem of internal 
inconsistency as mentioned in the introduction might arise 
if a is y-specific. Other attractive features of this version 
are that the value of a can be chosen to be well bounded 
away from zero (this should help to avoid sustained gaps 
between gr and rc series), and the number of extra controls 
is not doubled when both level and change driven pre- 
dictors are included, thus allowing for introducing more 
controls as well as more degrees of freedom in variance 
estimation. 

As a diagnostic of the impact of bias due to imputation 
of the previous month’s employment status in view of the 
nonresponse of some of the present month respondents, the 
following simple check can be performed. The basic idea is 
to compute a multiplicative bias adjustment factor to the 
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estimator Dix involving imputed values. The factor is 
defined as the ratio of two gr-estimators of the previous 
month’s characteristic based on the matched subsample. 
The denominator is a gr-estimator for the previous month 
(involving imputed values) while the numerator is a gr- 
estimator for the previous month (not involving imputed 
values), both computed in a somewhat nonstandard way. 
For the numerator, we use the time ¢- 1 respondents with 
their time ¢- 1 responses, and after nonresponse adjustment 
of the design weights, construct the gr-estimator with 
controls for time ¢. For the denominator, we assume that the 
subsets of each of the matched subsamples at t- 1 and t¢ 
(here the matching is done with respect to each other, one 
forward in time and the other backward) not having the 
counterpart because of nonresponse, are statistically 
exchangeable with respect to each other. We then replace 
the time ¢-1 respondents who did not respond at time ¢ by 
the time ¢-1 nonrespondents who responded at ¢, along 
with their imputed time ¢-1 responses as well as design 
weights. Now the nonresponse, and gr-poststratification 
(with controls for ¢) weight adjustments are redone for this 
modified full sample at t- 1. The gr-weights so obtained 
are used to compute the denominator mentioned above. One 
can now look at the time series of this factor over several 
months for diagnostics on the bias due to imputation. If this 
is not deemed close to one, then the average of the factor 
over several months can be treated as a nonrandom 
multiplicative bias adjustment to Day . In practice, instead 
of adjusting VD ibe it would be preferable computationally 
to adjust the new control C,_,,,,. (of equation 3.6) for the 
corresponding characteristic by inverse of the above multi- 
plicative factor. Alternatively, the need for imputation can 
be avoided altogether if the questionnaire can be modified 
to obtain the necessary past information as suggested in the 
introduction. 

The study of Lent, Miller and Cantwell (1994, 1996) 
considers the ak-composite weighted estimator for the U.S. 
Current Population Survey as an alternative to the currently 
used ak-estimator with a=0.2, k=0.4. Based on our 
experience with ak*, it may be recommended that the ak*- 
composite weighted estimator might be a better alternative 
in the interest of efficiency gains. 
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A Regression Composite Estimator with Application 
to the Canadian Labour Force Survey 


WAYNE A. FULLER and J.N.K. RAO’ 


ABSTRACT 


The Canadian Labour Force Survey is a monthly survey of households selected according to a stratified multistage design. 
The sample of households is divided into six panels (rotation groups). A panel remains in the sample for six consecutive 
months and is then dropped from the sample. In the past, a generalized regression estimator, based only on the current 
month’s data, has been implemented with a regression weights program. In this paper, we study regression composite 
estimation procedures that make use of sample information from previous periods and that can be implemented with a 
regression weights program. Singh (1996) proposed a composite estimator, called MR2, which can be computed by adding 
x-variables to the current regression weights program. Singh’s estimator is considerably more efficient than the generalized 
regression estimator for one-period change, but not for current level. Also, the estimator of level can deviate from that of 
the generalized regression estimator by a substantial amount and this deviation can persist over a long period. We propose 
a “compromise” estimator, using a regression weights program and the same number of x-variables as MR2, that is more 
efficient for both level and change than the generalized regression estimator based only on the current month data. The 
proposed estimator also addresses the drift problem and is applicable to other surveys that employ rotation sampling. 


KEY WORDS: Survey sampling; Rotating samples; Combining estimators. 


1. INTRODUCTION 


Composite estimation is a term used in survey sampling 
to describe estimators for a current period that use infor- 
mation from previous periods of a periodic survey with a 
rotating design. When some units are observed in some of 
the periods, but not in all periods, it is possible to use this 
fact to improve estimates for all time periods. 

Statistics Canada, U.S. Bureau of the Census and some 
other statistical agencies use a rotating design for labour 
force surveys. The current Canadian Labour Force Survey 
(LFS) is a monthly survey of about 59,000 households, 
which are selected according to a stratified multistage 
sampling design. The ultimate sampling unit is the house- 
hold and a sample of households is divided into six panels 
(rotation groups). A rotation group remains in the sample 
for six consecutive months and is then dropped from the 
sample completely. Thus five-sixths of the sample of 
households is common between two consecutive months. 
Singh, Drew, Gambino and Mayda (1990) and Gambino, 
Singh, Dufour, Kennedy and Lindeyer (1998) contain 
detailed descriptions of the LFS design. In the U.S. Current 
Population Survey (CPS), the sample is composed of eight 
rotation groups. A rotation group stays in the sample for 
four consecutive months, leaves the sample for the 
succeeding eight months, and then returns for another four 
consecutive months. It is then dropped from the sample 
completely. Thus there is a 75 percent month-to-month 
sample overlap and a SO percent year-to-year sample 
overlap (Hansen, Hurwitz, Nisselson and Steinberg 1955). 

Patterson (1950), following the initial work by Jessen 
(1942), provided the theoretical foundations for design and 


estimation for repeated surveys, using generalized least 
squares procedures. For the CPS, Hansen et al. (1955) 
proposed a simpler estimator, called the K-composite esti- 
mator. Gurney and Daly (1965) presented an improvement 
to the K-composite estimator, called the A4K-composite esti- 
mator with two weighting factors A and K. Breau and Ernst 
(1983) compared alternative estimators to the K-composite 
estimator for the CPS. Rao and Graham (1964) studied opti- 
mal replacement schemes for the K-composite estimator. 
Eckler (1955) and Wolter (1979) studied two-level rotation 
schemes such as the one used in the U.S. Retail Trade 
Survey. Yansaneh and Fuller (1998) studied optimal recur- 
sive estimation for repeated surveys. Fuller (1990) and 
Lent, Miller, Cantwell and Duff (1999) developed the 
method of composite weights for the CPS. The composite 
weights are obtained by raking the design weights to 
specified control totals that included population totals of 
auxiliary variables and K-composite estimates for charac- 
teristics of interest, y. Using the composite weights, users 
can generate estimates from microdata files for the current 
month without recourse to data from previous months. 

The above authors used the traditional design-based 
approach, assuming the unknown totals on each occasion to 
be fixed parameters. Other authors (Scott, Smith and Jones 
1977; Jones 1980; Binder and Dick 1989; Bell and Hillmer 
1990; Tiller 1989 and Pfeffermann 1991) developed esti- 
mates for repeated surveys under the assumption that the 
underlying true values constitute a realization of a time 
series. 

Statistics Canada considered K and AK composite 
estimation for the Labour Force Survey at several times 
during the past 25 years (Kumar and Lee 1983), but did not 
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adopt composite estimation. Instead, a generalized regres- 
sion estimator, based only on the current months data, has 
been computed with a regression weights program. When 
composite estimation was considered in the 1990’s, there 
was strong pressure to developed a composite estimation 
procedure that used the existing estimation program. Singh 
(1996) proposed an ingenious method, called Modified 
Regression (MR), to address this issue. This method leads 
to a composite estimator, called MR2 estimator, which uses 
the existing regression weights program. Singh suggested 
creating x-variables to be used as control variables in the 
regression program. With the created variables and the 
previous period estimator, the existing regression weights 
program is used to construct regression weights that define 
the estimator for the current period. Control variables with 
known population totals are also included. 

An empirical study of the MR2 estimator identified 
several characteristics of the procedure. First, the estimated 
variance of a one-period change is much reduced. Second, 
the estimated variance of level is often similar to that for the 
direct estimator. Third, the estimator of level could deviate 
from the direct estimator by a substantial amount and this 
deviation could extend over a long period. 

In this paper, we study the efficiency of MR estimators 
theoretically, under a simplified set-up. We propose also a 
“compromise” estimator that leads to significant gains in 
efficiency, for both level and change, over the estimator 
using only the current month’s data. The composite 
estimator also addresses the “drift” problem mentioned 
above and can be implemented using the existing regression 
weights program. Gambino, Kennedy and Singh (2000) 
evaluated the efficiency of the composite estimates for the 
LFS data, using a jackknife method of variance estimation. 
Bell (2000) compared several composite estimators using 
data from the Australian Labour Force Survey. 


2. COMPOSITE REGRESSION ESTIMATION 


There are two types of observations used in composite 
estimation; those observed only at the current time, t, and 
those observed both at the current time and at the previous 
time, t- 1. Sometimes information in previous observations 
is condensed in the estimate(s) for the previous period(s). 
Let w, be the sampling weight for observation i at time tf, let 
A, be the set of elements with observations at both the time 
periods t and t-1, and let B, be the set of elements 
observed only at the current time period ¢. In this initial 
context, 7 is the index for an individual respondent. If there 
is no nonresponse, the set A, for the LFS is composed of 
individuals in the five panels that were in the sample during 
the previous period, called the overlap panels. With no 
nonresponse, the set B, for the LFS contains individuals 


first observed in the current period, called the birth panel. 
Assume 


ys w, + SS w, = N, = estimated population total. 
ieA, ieB, 


Let 0, be the fraction of the sample in the overlap at time 
t 


P8) 


In the Labour Force Survey 8, is about 5/6 and is nearly 
constant over time. We will frequently omit the subscript t 
on A,, B,and 8,, for simplicity. 


2.1 Estimator 
Singh’s (1996) MR2 estimator uses the control variable 
Xig = 8) Opry Id ty AEA, 


=J,; ich (2:2) 
in the regression program, where y,, is the value of a 
characteristics of interest, y, for element i at time t. Because 
of nonresponse in the LFS, Singh’s original proposal used 
imputation for missing data and set 6 = 5/6, after imputation 
for missing data. In our initial discussion we use the 0, as 
defined in (2.1), assuming no nonresponse so that impu- 
tation is not required. Note that “micromatching” of indi- 
vidual data files at t- 1 and tis needed to calculate x, ,, and 
the resulting MR2 estimator. Additional control variables of 
the form (2.2) associated with other y-variables as well as 
auxiliary variables with known population totals are also 
included in the regression estimation. The auxiliary 
variables in the LFS include demographic variables such as 
age, sex and location. 

The particular x-variables in (2.2) is designed such that 
the estimated total of x, is an estimator of the previous 
period total of y. Thus, the control total for x, in the 
regression procedure is the previous period estimator of the 
total of y. 

Let fi,_, be the estimator of the mean of y for period 
t—1, let y,, ,_, and y,,, be the means of the matched panels 
at time t-1 and t respectively, let y, be the grand mean of 
all sample panels at time f, and let y, , be the mean of the 
birth panel at time t. Assume the sample of size n is divided 
into g panels of equal size and denote the matched sampling 
fraction by 9. To simplify the discussion we consider a 
single y-variable. Then Singh’s (1996) MR2 estimator at 
time t, constructed with x,_., can be written in a regression 
estimator form as 


A, =¥,+ Soy, Xey)Bey +(A,, Opie ving +5,)| b,, (2.3) 
where X_,,, is the population mean of the vector of auxiliary 
variables, such as age and sex, at time t, X,, is the weighted 


sample mean of the auxiliary variables, and (Bé, ,b,)’ is the 
vector of regression coefficients for the regression ‘of y, on 


Li 


Xe» X1,)- : 
One can write 


Y1i V1 * 4, iy? 


Survey Methodology, June 2001 


where ¥, ; (, 18 the predicted value in the regression of y, , 
On? FF: and dw is the deviation from the regression 
predicted value. Then 


~a-175 M / 
oy 6 (Visti @-1) + 4140-1) Via dia) 
+IFW) +d, i) tial € A’ 
~ ast) +d. i if FEB. 
For demographic variables X,,,,, it is reasonable to believe 


that y,- LEG4) is close to y,_ 11,0 and close to 
Iria Therefore the part of x,, that is ‘orthogonal to Xe, 1S 
close to 


LAA 
Xa1,t =a) (41 ia-1 a, iw) 
+d iay Lie) A 

=d,5 if i€B,. 


Thus the partial regression coefficient b, is close to the 
regression coefficient for the regression of ae, () ON Xy a - 
and the value depends on the correlation between q, i () an 
Z._1i¢-1 & Simple model for d, , w that has been used in 
the past, and the one we adopt in our analysis, is the 
assumption that the d, , « 1S the sum of a fixed p, and an 
error that is a first order autoregression with parameter p. 

To simplify the presentation, we discuss the simple 
random sampling model without x_,. The results extend to 
the general case by considering the parameter p to be the 
partial correlation between y, and y,_, after adjusting for 
Maas 

‘Under the autoregressive model with fixed p, an 
intercept and no other x , in the model, it can be shown 


that b, converges in probability to 


: = =1 
by =p lim 6, = 6p [2-0 -2(1 -8)p-(1 -6)9,"A?]*, 


where Ay =(,-,_,)°. Assuming (1-6) 0, A; is small 
relative to the other terms we get 


by = 8p [2-8-2(1-8)p]" (2.4) 


For the LFS, b, = (7-2p)"'Sp. 

Alternative representations for the estimator fi,, omitting 
Xc,, are obtained using the formula y, = Oy, + 
(1 - 0) ¥,,.Thus 


,=(1-b)3, +00)1 + Gar ~Inr-DIO 

i] baiee? ¢eota eee] 
(O(N ¥, 1 Ve b+ 9) (1-5) py. 

=(0+(1-6)D]LF,, + (0, ,-Vy p12] 
+(1-0)(1-d)y», 


= Aig Fong Oy, Fons 8 V4 Ay) Fp.e (2.5) 
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where 
1-1, = (1-0)(1-3,) 
and 
b= 10+(1-8)b,|7 (2.6) 


The first expression on the right of the equality of (2.5) 
gives the MR2 estimator as a linear combination of the 
direct estimator y, and the difference estimator Brags 
(Vint ~Ym,r-) 1--, in the form of a composite estimator. The 
final expression of (2.5) gives the estimator as a linear 
combination of a “regression-type” estimator based on the 
overlap panels and the mean of the birth panels. 


2.2 An Alternative Estimator 


It is possible to define alternative regression variables to 
use in regression composite estimation. We present a 
particular regression variable in this subsection. The 
associated regression estimator is not suggested as the 
ultimate estimator, but the estimator is a member of a class 
for which efficiency calculations are given. An alternative 
to Singh’s (1996) MR2 estimator is outlined in section 5. 

Define a variable to be equal to the previous period value 
if the individual is in the overlap sample and to be equal to 
the estimated mean for the previous period if the individual 
is in the birth sample. The regression variable is 


yp NIT if i€A, 


=A, if ieB. (2.7) 


If this variable is used in a regression estimator, the control 
mean is fi,_,, the previous period estimator, because the 
mean for the created variable is estimating the mean for 
period t-1. Singh (1996) used a variable x,,, similar to 
X,,;- In Singh’s variable, the fi,_, in (2.7) is y,, _, if i¢ B,. 
Consider a regression estimator constructed with x, , 
and recall that the control mean of x,, is fi, ,. The 
regression estimator using x,, can be written 
Os ial ¢ Pat 20 (2.8) 
where is the regression coefficient for the regression of y, 
on x,, (subscript tis dropped on B. for simplicity), y, is the 
sample mean of y at time ¢, and x, , is the sample mean of x, ,, 
for all sample panels at time t- 1. The regression coeffi- 
cient § is, approximately, the regression of y, on x,, in the 
set A. The coefficient is not exactly the regression coeffi- 
cient for the set A because y,, ,_; is not equal to fi,_,, but 
the difference between the two estimators will usually be 
small. Singh (1996) called the regression estimator 
constructed with ey 4» the MR1 estimator. 
Using y, = Oy, +(1 -9)y,,, the regression estimator of 11, 
using x,, as a control variable is given by 
(2.9) 


f= CRO) OLY, OA Pa Pd 
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The expression within curly brackets in (2.9) is the 
regression estimator of p1, using the estimator fi,_, and only 
the data from the matched sample A. Note that the 
regression estimator 


hy= pg (ae vat) br (2.10) 


where f is the regression of y, on y,_, in the set A, is the 
optimal linear estimator for 1, based on fi,_, and the data 
of set A. Note that B = p if the variances are the same at the 
two time periods. Hereafter, we often set B =p. 

Using the variable x,, gives the optimal estimator, i, 
based on data in set A, but it does not combine that 
estimator with the mean of set B in an optimal way. As can 
be seen in (2.10), the weight given to the mean of set B is 
1-0. In general, this weight is too large because the 
variance of the regression estimator is less than the variance 
of the simple mean. 


3. OPTIMAL ESTIMATION 


The way in which one chooses to combine the regression 
estimator for set_.A with the mean of set B depends on one’s 
objective function and on the variance of fi,_,. We give 
some illustrative calculations based on some simplifying 
assumptions. For convenience let V{fi,_,} be expressed as 
a multiple of the variance of the birth panel, 


Ata = qe V{¥5 3 : (Ga) 
Assume 
Vio =g7' ViNg (3.2) 
Cov {f,_,, (Oey Sy aps 0, (3.3) 
Cov {fi,_,, Vp it =/0; (3.4) 
and 
COVID 5 520 Vert Vngab hes Os (3.5) 


where g is the number of rotation groups (panels). 
Assumption (3.1) is reasonable if the original panels have 
a covariance function well approximated by that of a first 
order autoregressive process. For the LFS, the zero 
covariances in (3.4) and (3.5), and assumption (3.2) are 
only approximations because Yp,, 18 not based on an entirely 
independent sample. 

We write the regression estimator based on the overlap 
as 


fie = Veta een 4 A,B 


and, with the assumptions, obtain 


Vif) ais } 0: ap) gmp Wate Pe 1.6) 


For the LFS, g = 6 is the number of panels. Now consider 
an estimator that is a linear combination of fi, , and J, ,, 


A, 7 Aint ae ( -A) Vp 


=N (Vian Jee +f, B) +1 “Mp1 (3.7) 
where 0<A<1 is to be determined. To minimize the 
variance of current level, given fi,_, with variance 
ae V{¥, ,}, one would minimize 


V{O} =V (AB + (1-9) F9,} 


= r? V {pa 7G Ay VAP ls (3.8) 
with respect to A. Under the assumptions (3.3), (3.4) and 
(3.5), the optimum A for current level is 


hoe = Lg? O11 -p?) +4, 9? + 177. 


However, if one is planning on using the estimator for a 
long period of time, one must realize that only certain 
values of q, are possible in the long run. The value of A 
chosen to estimate y, determines the variance of fi, and 
hence, determines the variance that will go into the 
estimator of p,,,. Assuming B =p, we have 


V(A,)={e-) 01A2(1 -p%) +q,1A2p?+ (1-2) bp, } 


or 
Gi =B) O11 -p?)+ (1-2 + pq, ". (3.9) 
Thus, for a given A, the limiting value for ga is 
lim 9, '=(1-2?p?) "[g* 6727(1 -p?) 
ck Oe) 
(3.10) 


This result is equivalent to that given by Cochran (1977), 
page 352 equation (12.86). 

Table 1 contains values of the limit variances as the 
number of periods becomes large, for selected values of p 
and A, where 6 = 5/6 and g@=5 for the LFS. The 
variances are standardized so that the variance of the direct 
estimator based on the mean of six panels is 1.00. Thus, the 
entries are six times the limiting value in (3.10). If the 
correlation is 0.95 and A is set equal to 0.96, the long run 
variance of current level is 70 % of that of the direct 
estimator. If A is set equal to 0.90, the long run variance is 
58 % of that of the direct estimator when p = 0.95. 

The first line in Table 1 is for A = 5/6 = 0. This is the A 
corresponding to the use of x,, in a regression estimator. 
The variance with A = 5/6 is always smaller than that of the 
direct estimator because of the improvement associated 
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with the use of the regression estimator fi, ,. Thus, if p #0, 
the regression estimator with x,, leads to significant 
reduction in variance over the direct estimator, y - that uses 
current data only. 


Table 1 
Standardized Limit Variances of Level: 
LFS Rotation Pattern 
p 
ny 0.70 0.80 0.90 0.95 0.98 
0.833 0.897 0.840 0.743 0.665 0.600 
0.840 0.895 0.836 0.734 0.650 0.581 
0.860 0.894 0.830 0.714 0.614 0.527 
0.880 0.903 0.835 0.705 0.588 0.481 
0.900 0.921 0.851 0.711 0.575 0.444 
0.920 0.951 0.882 0.736 0.582 0.420 
0.940 0.992 0.928 0.785 0.617 0.420 
0.960 1.046 0.994 0.867 0.698 0.465 
0.980 iki Wiss 1.083 0.997 0.861 0.619 
0.990 1155 1.138 1.087 0.998 0.803 
0.995 Ls 9) 1.168 1.140 1.089 0.960 


The optimal 4 is a function of p and increases slowly as 
p increases. For p = 0.0, the optimal A is 0.833, for p = 0.7 
the optimal A is about 0.85, for p = 0.95 the optimal A is 
about 0.91 and for p = 0.98 the optimal A is about 0.93. 

We now turn to the MR2 estimator (2.3) which can be 
written as 


a, G 4 sae 1G ta a MAN eae ha) YB. 


where i, and b * are defined in (2.6). While the MR2 esti- 
mator is not a member of the class (3.7), to the degree that b * 
is “close to” p, it is “close to” a member of the class. For 
example if p = 0.95, then by = 0.9314 and b*= 0.9422. If 
p = 0.90, then by = 0.8659 and b* = 0.8853. 

Using the limiting value b, of b, we have (1 -A,) = 
(1-8) (1-b,), where by is given by (2.4). Then A, = 
0.9375, 0.9568, 0.9776, 0.9886, and 0.9954 for p = 0.70, 
0.80, 0.90, 0.95 and 0.98, respectively. From Table 1, the 
standardized variances of fi, for these values of 1, are 
0.986, 0.982, 0.978, 0.976, and 0.975, for p = 0.70, 0.80, 
0.90, 0.95, and 0.98, respectively. Thus, the MR2 estimator 
for current level has an efficiency for level that is essentially 
the same as that of the direct estimator, y,. The efficiency 
of the MR1 estimator is that for 1 = 0.833 in Table 1 and is 
always superior to that of y,. 


4. VARIANCE OF ONE-PERIOD CHANGE 


Given fi,_15 You 1-19 Ym, ANd Vp , the optimal estimator of _; 
is no longer fi,_, because y,, , contains information about 1,_,. 
However, it is not customary practice to revise the estimator 
of p,_,. Given no revision, the estimator of change is 


A, = fi,_,- 
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Under no revision in fi,_, and conditions (3.2) through 
(3.5), the variance of fi,-(i,_,, where fi, is defined in (3.7), 
is 


Vio, Siler} z V{ALY, + (0,_, ~X, ,) Pp] 
sa ¢! A) 9g By} 
alee Os (hop yc a). 


=X Dade IVs fa (41) 

Table 2 contains standardized limit variances of the 
estimated change, fi,-fi,_,, for selected values of g and i, 
with g@ = 5. The entries in the table are the limiting 
variances of estimated change divided by the variance of 
change based on the common elements, V { Neg pais 
The variance of change based on the common elements is 
20°1(1-6)(1 -p) V{ip,,}- 


Table 2 
Standardized Limit Variances of No-Revision 
One Period Change: LFS Rotation Pattern 


p 
X 0.70 0.80 0.90 0.95 0.98 
0.833 1.039 1.168 1.550 2.312 4.595 
0.840 1.024 1.142 1.492 2.189 4.277 
0.860 0.989 1.079 1.345 1.872 3.454 
0.880 0.963 1.029 1.223 1.607 2.756 
0.900 0.947 0.993 1.127 139t 2.181 
0.920 0.940 0.970 1.055 12222 1723 
0.940 0.942 0.959 1.007 1.100 1.379 
0.960 0.953 0.961 0.982 1.024 1.146 
0.980 0.972 0.975 0.980 0.991 1.021 
0.990 0.985 0.986 0.987 0.990 0.998 
0.995 0.992 0.993 0.993 0.994 0.996 


Tables 1 and 2 make clear the cost of not revising the 
estimate of fi,_,. For example, if p = 0.95, the variance of 
no-revision one period change is minimized with A = 0.99, 
but the variance of level is minimized with 1 = 0.91. A 
compromise value of 1 = 0.95 gives a variance of level that 
is about 14 % larger than optimal and a variance of change 
that is about 7 % larger than the smallest variance of Table 
DY 

Using the values of A, associated with the MR2 
estimator, the entries in Table 2 are 0.940, 0.960, 0.979, 
0.989, and 0.996 for p = 0.70, 0.80, 0.90, 0.95 and 0.98, 
respectively. Thus, given no revision, and ignoring the 
difference between b, and p, the MR2 estimator is nearly 
optimal as an estimator of change, unlike the MRI 
estimator, where the MRI estimator corresponds to 4 = 
0.833 in Table 2. 
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5. A COMPROMISE ESTIMATOR 


On the basis of Table 2, the efficiency of the MR2 
estimator of change for the LFS based on x,,, for the no- 
revision case, is quite good. The MR1 no-revision estimator 
of change based on x,, has relatively poor efficiency 
because it is a member of the class (3.7) with 1 = 0 = 
0.8333. On the other hand, the MR1 estimator of level 
based on x,, is superior to the MR2 estimator based on x,,, 
and there are members of the class (3.7) that are much 
superior to the MR2 estimator of level. 

Because the 4 in the MR2 estimator is relatively large 
and the for the MR1 estimator is relatively small, we can 
create approximations to most interesting members of the 
class (3.7) as linear combinations of (2.10) and (2.5). Let 

Xq 4g = OX, +L -a)x% ,;, (ead) 
where 0<a< 1 is a fixed number. The regression estimator 
based on x,,, gives an approximation to a member of the 
class (3.7) with 


N=ad,+(1-a)6, 
where i, is defined in (2.6). Thus, if. p=,0.95; 


(5.2) 


dX = a (0.9886) + (1 -a) (5/6), 
for the LFS rotation pattern with 0=5/6 and 
b, =(7 -2p) 5p; 4= 0.95 if w= 0.75. 

We choose a, to give the desired combination of y, , and 
the “regression estimator” based on observations in set A. 
If one does not revise the estimator of y,_,, the preferred 
combination depends on the relative importance assigned to 
the variance of level and to the variance of change. 

Table 3 gives the variance of the MR2 estimator (a = 1) 
relative to the variance of the estimator constructed using a 
= 0.75 and the variance of the estimator constructed using 
a = 0.65. An entry in Table 3 for ff, is expression (3.10) 
evaluated at 1, of (2.6) and p , divided by (3.10) evaluated 
at ) of (5.2) and p. An entry for fi,-fi,_, is expression (4.1) 
evaluated at 2, of (2.6) and p, divided by (4.1) evaluated at 
dX of (5.2) and p. These are approximations to actual 
efficiencies because p is used for the coefficient of X,. Itis 
clear from Table 3 that the compromise estimator is slightly 
inferior to the MR2 estimator for one-period change, but is 
much superior to the MR2 estimator for level. For example, 
with p = 0.95 and a = 0.65, the relative efficiency of the 
compromise estimator is 1.62 for level and 0.87 for one- 
period change. 

For larger values of p, the variance of change is much 
smaller than the variance of level. Thus, for p = 0.95, the 
variance of level and of change for a = 1.00 are about 1.00 
and 0.12, respectively, while the variance of level and of 
change for a = 0.75 are about 0.67 and 0.13, respectively, 
when expressed in common units. 

The smaller a has the advantage that the composite 
estimator will be closer to the direct estimator. Thus, 


potential biases associated with the composite estimator 
should be smaller with the smaller a. 


Table 3 
Approximate Efficiencies of Compromise 
Estimators Relative to a = 1 


a=0.75 a= 0.65 
p by 1-h, A, A, -A,_, A, A, -A,., 
0.70 0.625 0.0625 1.052 0.999 1.069 0.995 
0.80 0.741 0.0432 1.099 0.994 1.129 0.984 
0.90 0.865 0.0224 1.238 0.975 1.303 0.946 
0.95 0.931 0.0114 1.502 0.936 1.616 0.875 
0.98 0.972 0.0046 2.177 0.833 Deyo | 0.712 


6. DRIFT PROBLEM 


As noted in the Introduction, the MR2 estimator could 
deviate from the direct estimator by a substantial amount 
and this deviation could extend over a long period. We now 
illustrate the basis for this phenomenon. We can express the 
deviation of the compromise regression estimator fi,, based 
on x,,,, from the true mean p, as 

tal 
fi,-u,=(4p)"(@y - Hy) +> Ap) 


j=0 


[Fry * 1-5) (6.1) 


where [J, is the mean at the initiation of the process and 


ee r vy a H, ~?p Ore el) 

If p is close to one and we use A = 1, then the error 
fi,- H, behaves roughly as a random walk which can lead 
to long periods in which fi, - p, has the same sign. On the 
other hand, if a<1 and p =1, then A<1 and the error 
fi, -, exhibits less drift. For example, if a = 0.70, the 
correlation between adjacent errors fi, -p,,will be no 
greater than 0.95 under assumption (3.2)-(3.5). For the 
MR2 estimator, 4~ 1 as p> 1 and hence the MR2 estimator 
can exhibit drift for p close to one. 


7. CONCLUDING REMARKS 


For simplicity, we often assumed simple random 
sampling to obtain theoretical results. Similar results hold 
for complex designs and additional auxiliary variables, by 
considering p to be a partial autocorrelation. Also, we used 
x,-variables corresponding to only one variable y, but 
several y-variables can be used in constructing the corres- 
ponding x-variables for use in regression estimation. 
Gambino, Kennedy and Singh (2001) conducted an 
empirical study with LFS data using several x,-variables 
with a common a, and arrived at a compromise a for use in 
the LFS. 
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In section 2.1, we assumed no nonresponse so that 
imputation is not required. But in the LFS, nonresponse on 
an item y may occur either at time t- 1 or a time tf or at both 
time points. Gambino, Kennedy and Singh (2001) provide 
details of the imputation methods actually used in the LFS. 
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Comparison of Alternative Labour Force Survey Estimators 


PHILIP BELL’ 


ABSTRACT 


This paper looks at a range of estimators applicable to a regularly repeated household survey with controlled overlap 
between successive surveys. The paper shows how the Best Linear Unbiased Estimator (BLUE) based on a fixed window 
of time points can be improved by applying the technique of generalised regression. This improved estimator is compared 
to the AK estimator of Gurney and Daly (1965) and the modified regression estimator of Singh, Kennedy, Wu and Brisebois 


(1997), using data from the Australian Labour Force Survey. 


KEY WORDS: Composite estimator; Best linear unbiased estimator; Modified regression; Repeated Surveys. 


1. INTRODUCTION 


This paper looks at a range of estimators applicable to a 
regularly repeated household survey with controlled overlap 
between successive surveys. The common theme of the 
estimators is to use data from previous times to improve 
current estimates, by taking advantage of correlations in the 
overlapping sample. I refer to all such estimators as com- 
posite estimators. 

The estimators are evaluated for use in the Australian 
Labour Force Survey (LFS). In the LFS, overlap is 
controlled by dividing the first-stage sample of geographic 
areas into eight “rotation groups” from which dwellings are 
selected. In each month the same dwellings are selected 
from seven of the rotation groups, while new dwellings are 
selected in the remaining group. The sample consists of 
civilian persons aged 15 years old and over residing in the 
selected dwellings. 

This sample design leads to high overlap of sample 
between two successive months within the seven “matched 
rotation groups”. Using only data from these rotation 
groups rather than the whole sample can decrease the sam- 
pling error on an estimate of month to month movement. 
Composite estimation techniques seek to exploit this to give 
estimates with lower sampling error. 

Section 2 of the paper introduces the Australian LFS and 
its current “generalised regression” estimator. The issue of 
time-in-survey bias (called rotation group bias by Bailar 
1975) is also discussed. 

Section 3 presents the “AK composite” estimator pro- 
posed by Gurney and Daly (1965). This method has been 
used in the US Current Population Survey for many years. 
An extension known as “4K composite weighting” has 
been used for the last few years; this was proposed by Fuller 
(1990) and studied by Lent, Miller and Cantwell (1994, 
1996). 

Section 4 presents the “modified regression” method of 
composite estimation (Singh and Merkouris 1995, Singh 
1996). Here I focus on the MR2 estimator of Singh, et al. 
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(1997), which provides the largest reductions in sampling 
error. I also present a variant of this method suggested by 
Fuller (1999) for use in the Canadian Labour Force Survey. 

Section 5 presents a “Best Linear Unbiased Estimator” 
(BLUE) based on data from a “window” containing a fixed 
number of successive months. This estimator was originally 
given by Jessen (1942) in the case of 2 occasions. A BLUE 
based on all occasions in a long series appears impractical, 
though a recursive approximation to this was developed by 
Yansaneh and Fuller (1998). This paper improves the fixed 
window BLUE described in Bell (1998) using the technique 
of generalised regression. 

Section 6 gives the results of applying the different 
methods to the estimation of employed persons and unem- 
ployed persons in the LFS. Standard errors are estimated for 
longer-term indicators such as trend and trend movement, 
as well as for estimates of monthly level and its movement. 
Possible biases are explored, as well as evidence of change 
to seasonal patterns. 

I conclude by comparing the advantages and disadvan- 
tages of the different types of estimator for application in 
the LFS. The improved BLUE estimator is found to be 
efficient, and when applied to the LFS is not subject to any 
large bias. 


2. CURRENT ESTIMATES FOR THE LABOUR 
FORCE SURVEY 


2.1 Overview of the LFS 


The LFS has a multistage sample design, the first stage 
being a sample of small geographic areas known as 
“Census collector’s districts” (CDs). A new sample of CDs 
is selected every five years, and the CDs are classified to 
eight “rotation groups”. The dwellings selected from a CD 
remain in the sample for eight surveys, and then are 
replaced by other dwellings from the same CD. This 
replacement of dwellings is known as rotation, with all the 
dwellings in a rotation group being replaced at the same 
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time. Interviewers seek to collect data for all in-scope 
persons in the selected dwellings. 

Of particular interest in the LFS is the person’s labour 
force status — whether they are employed, unemployed or 
not in the labour force. The number of persons in each 
labour force status, for various categories of person, are key 
items to be estimated in the survey. Even more important to 
many users of the survey data than these level estimates are 
the estimates of movement in the figures between suc- 
cessive time points. It can be argued that longer-term 
indications of the direction of the series are even more 
important e.g., the movement of the X11 trend or of a 
similar smoother (Bell 1999). 

The sample design ensures that the unconditional 
probability of selection 1, ; is known for each sampled 
person i at time ¢. This allows a simple estimator for a 
population total due to Horvitz and Thompson (1952). If Y, 
is the population item to be estimated at time t, and y,, is the 
same item as reported by the i-th unit at time t, the Horvitz- 
Thompson estimator is 


aH Tt 
va 3 Wii Viti (1) 
I 


1 =il ; : 
for w,, =1,; , known as the selection weights. 


2.2 The Generalised Regression (GR) Estimator 


Generalised regression is a method for adjusting or 
“calibrating” a set of unit weights to add to a set of popula- 
tion attributes known as benchmarks. For a suitable choice 
of benchmarks the resulting weights give an improved esti- 
mate by taking account of externally available information. 

In the LFS we start with the Horvitz-Thompson weights 
and calibrate them to add to demographic benchmarks that 
give numbers of people in the population for 560 poststrata 
(14 geographic regions classified by sex and 20 age 
groups). The weights from a given post-stratum are prorated 
to add to the stratum benchmark. This post-stratified ratio 
estimator is a particular case of the generalised regression 
or GR estimator. 

Let x,, be a row vector of auxiliary variables for unit i at 
time ¢, and £, = Lb, x,, be estimates for the corresponding 
row vector of benchmark values X ,» based on some initial 
weights b,,. The GR estimator based on these initial 
weights is then given by 


y= 9,4 (XK, - £8 (2) 
Z , =i : 
for B = (= b,; X;j *4| Ds b, Xi Vai" (3) 
i.€., se = Si Wi Yi for 


2 ; cs 
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In post-stratified ratio estimation the row vectors x,; 
contain zeroes except in the column corresponding to the 
unit’s post-stratum, and b,, are the selection weights w,;. In 
this case the regression parameters are just the post-stratum 
means, estimated using the selection weights. 


2.3 Rotation Group Estimates 


Each rotation group consists of a representative sample 
of dwellings, and so can provide a separate estimate. 
Number the rotation groups at a time point according to the 
number of times the dwellings in the rotation group have 
been sampled. Write R( t, i) = r if unit 7 is in the rotation 
group sampled for the r-th time at time ¢. The Horvitz- 
Thompson estimate of Y, based on rotation group r is 

ee = > 


Sw, y.. 
i:R(t,i)=r hat (5) 


Generalised regression can be used to improve these 
estimators, by calibrating the weights to add to a set of 
benchmarks. Unfortunately the lower sample size in a 
single rotation group may require using a smaller number of 
benchmarks than in the overall case. In the LFS situation I 
used a single generalised regression step on the whole 
sample so that across the whole sample the weights add to 
the benchmarks for the current 560 poststrata, while in each 
rotation group the weights add to an eighth of the bench- 
marks for 71 collapsed poststrata. The resulting weights, 
when applied to a given rotation group r and multiplied by 
eight, give the rotation group estimates aa 


2.4 Time-in-Survey Bias 


Ideally rotation group estimates should have the same 
expectation Y,, but in practice they have slightly different 
expectations, and hence different biases. Some of the diffe- 
rence is due to collection practices — for example, dwellings 
sampled for the first time are interviewed using a personal 
visit, while other rotation groups are mostly interviewed by 
telephone. It is not clear which rotation group is least 
affected by this sort of “time-in-survey” bias. The overall 
estimate will have a time-in-survey bias that is some mix of 
the biases from each rotation group. We rely on good 
survey methods to keep this bias small. Note that all the 
composite estimators will have different contributions from 
the rotation groups, and therefore different time-in-survey 
biases. 


3. AK COMPOSITE ESTIMATION 


3.1 AK Composite Estimator 


The AK composite estimator (Gurney and Daly 1965) is 
designed to put extra emphasis on the movement from the 
matched rotation groups (those rotation groups in which the 
same dwellings were selected in the current and previous 
months). The estimator has three components. The first is 
a mean of the rotation group estimates for the current month 
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data (time t). The second is last month’s AK composite plus 
a movement estimate based only on the matched rotation 
groups. The third component is the difference between 
estimates from the unmatched rotation group and from the 
matched ones. How much of each component to use is 
given by two parameters A and K, as follows: 


; 1 oR 
eK a, 
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3.2 Choosing Parameter Values 


The key parameter is K, which gives how much of the 
new estimate is based on the matched rotation group move- 
ment. The optimal A and K to use will depend on the 
variable being estimated. Higher K values are appropriate 
for employment than for unemployment, since employment 
has a higher correlation between months. 

AK composite estimates of persons employed, unem- 
ployed and “not in the labour force” will not add correctly 
to the total population unless the same parameters are used 
for all the estimates. This leads to using a compromise 
choice of A and K. The results in this report are based on 
A =0.06 and K = 0.7. These values were found by trying a 
range of values of A and K, and choosing those that gave 
optimal employed estimates. In this study no values of A 
and K gave unemployed estimates appreciably better than 
these values. 

Our empirical study did not show particularly good 
sampling errors for the AK estimator. The fine calibration 
that was used in obtaining the rotation group estimates may 
be to blame — it is possible that using broader categories 
would improve the sampling errors. 


3.3 Properties of the AK Estimator 


The AK estimator puts extra emphasis on the movement 
in the matched rotation groups. Thus the rotation group 
containing dwellings in sample for the first time contributes 
less than in the GR estimator. The AK estimator thus has a 
different time-in-survey bias to the GR estimator. 

The AK estimator is recursive, in that last month’s 
estimator is required in order to produce this month’s. This 
is inconvenient for producing estimates for a new item or 
category. Also, the need to use the same values of A and K 
for all items can give sub-optimal estimates for any given 
item. 

These concerns have led to the US Current Population 
Survey changing to a variant known as “AK composite 
weighting” (Lent, Miller and Cantwell 1994). In AK com- 
posite weighting, separate employed and unemployed esti- 
mates are produced for a number of important published 
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categories, using the AK composite with optimal parameters 
for the estimate in question. The current data is then 
calibrated so that the unit weights add to these AK estimates 
as well as demographic benchmarks. All estimates are then 
produced from the current dataset using these new “AK 
composite weights”. 

The convenience of producing all estimates as a 
weighted sum of a single month’s data is a major advantage 
of the AK composite weighting approach. Another is that 
the most important estimates are AK composite estimates 
with near-optimal choice of AK. A disadvantage is that only 
the most important estimates are true composite estimates. 
Any other estimates (including estimates of persons not in 
the labour force) are typically not much improved over the 
standard GR estimates (Lent, Miller and Cantwell 1996). 


4. MODIFIED REGRESSION ESTIMATION 


4.1 Overview of Modified Regression 


The modified regression method is another way to pro- 
vide composite estimates that can be obtained as weighted 
aggregates of the current survey dataset. The method targets 
a predetermined set of key items, for which it achieves 
particularly low sampling errors. 

The modified regression technique uses generalised 
regression on the current month’s dataset after attaching 
new auxiliary variables z,, to each unit i at time 7. Here z,, 
is a row vector with an element for each of the key items. 
Corresponding to these we have “pseudo-benchmarks” Z, 
based on the previous month’s estimates for the key items. 
The modified regression estimator is then given by a gene- 
ralised regression step applying both the demographic 
benchmarks and the pseudo-benchmarks. 


Somes HCL) AF, LB, (7) 
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The key to the method is the definition of the auxiliary 
variables. Let D be the set of units in the matched rotation 
groups (those with dwellings selected at both time points) 
at time t. Let y,, be the vector of key items for unit i at time 
tand Y, the corresponding population totals. For ie D, let 
y,_1,; be the previous month’s value for the vector of key 
items, or if no value was reported let y,_, ,; be imputed — I 
used y,_, ; = y, ; as suggested by Singh (1996). 
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I look at modified regression estimates for z,, of the fol- 
lowing form, for ae [0, 1]: 


ome e SSu ts ; 
zat-a Sy,» On “| for i¢D 


= ay, ; for i¢D. (10) 


Given this definition we have 


f= (Ua $, a +a, 2 S; = Ses ent) 
~ *HD 
where. oa ap = OTe 1; and J, = = 8/72, -p 

Wy j niNpi are estimates of Y,_, and Y, respectively based on 
units in D only and using this month’ s selection weights. 
For a =0, ioe is just the estimate ¥,_, . For a= 1,2, is 
this month’s Horvitz-Thompson estimate minus an estimate 
of movement based on the matched rotation groups 
pO - 9"). Values a =0 and a =1 give the methods 
MRI and MR2 respectively of Singh et al. (1997). Use of 
an intermediate a was suggested by Fuller (1999). 

An appropriate pseudo-benchmark Z, would be an 
estimate of Y,', adjusted to agree with this month’s 
weights. Following Singh et al. (1997) I used a step of 
generalised regression to adjust last month’s modified 
regression estimator to add to this month’s benchmarks: 


Lec) ba (12) 
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Note that Z, = en since oM =X) =X) This 
completes the definition of the modified regression esti- 
mators. 
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4.2 Properties of een? Regression Estimators 


The movement Kha = Sue at (11) is actually based on 


the matched sample only (i.e., units reporting at both times), 
since other units in the matched rotation groups D contri- 
bute zero to the movement (for the imputation used here). 
This may lead to the modified regression estimators having 
a lower sampling error than an AK estimator, as this 
“matched sample movement” is not affected by units not 
present in both months. 

Unfortunately, this also gives the possibility of a bias if 
persons not represented in the matched sample have diffe- 
rent behaviour to those in the matched sample. This may 
well be the case — the matched sample excludes persons that 
changed dwelling between the two months, and it is 
possible that changes of dwelling are related to changes of 
employment. This “matched sample bias” will be in addi- 
tion to any time-in-survey bias. 

Another problem arises with the MR2 estimator (i.e., 
a=1). If the k-th key variable y,, has high month-to- 
month correlation then it will also have a high correlation 


with the k-th new auxiliary variable Z;,- For such+a 
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variable the element of Bx corresponding to z,, , will take 
some value y, close to one. Using (7), (11), and Z,= 91-1) 


the MR2 estimator takes the form 


a *M 


a*H na *M ~*HD ~+HD 
Ys, k (i eiphee: VV OE ) 


ya 
+ other terms. (14) 


In this case it is possible that the matched sample 
movement at a given time will have a strong influence on 
estimates for many time points thereafter. In addition, any 
small bias in the movement will tend to accumulate over 
time. This danger was recognised by Fuller (1999), and 
referred to as “the drift problem’. This was a motivation for 
his suggestion of the form of estimator given here, with a 
value of a less than 1. 

In summary, modified regression has similar advantages 
to the AK composite weighting approach, but with possibly 
lower sampling error. The method is not difficult to apply, 
and avoids the need to separately calibrate the rotation 
groups to the benchmarks. 


5. BEST LINEAR UNBIASED ESTIMATION 
(BLUE) 


5.1 Fixed Window BLUE 


The fixed window BLUE estimator (denoted by 9°) is 
obtained by choosing an “optimal” linear combination of 
the rotation group estimates y, (as defined in 2.3) froma 
window of / + 1 months, as follows: 


t 8 
i ed (15) 

s=t-l r=1 
where the parameters a. are chosen to minimise vary ve ay 
under the constraints Sitti = 1 fors=tand D> =0 
for s =t-l1,...,t - 1. These constraints ensure that ye will 
be unbiased for ie provided t that the rotation el esti- 

mates are unbiased, i.é., Evo oe Y, for 3 =¢-7,". 

The minimisation requires knowing the variances me 
covariances of the rotation group estimates. In practice 
these are estimated based on historical data. The problem 
can then be written in a matrix form: we aim to choose the 
column vector a (with elements a. for s =f 6.2, ¢.and 
r=1,..,8) so as to minimise a quadratic form a' Va 
subject to constraints C’ a = c. The relevant standard result 
(Rao 1973 page 65) is that the minimum occurs for 
a=V'Cq where q is a solution of (C’V !C)gq =c. In 
this study the matrix V was replaced by a correlation matrix, 
under the assumption that all the rotation group estimates in 
the window had the same variance. 


5.2 Correlation Structure of Rotation Group 
Estimates 


Since different correlation patterns give different BLUE 
estimates, choosing a correlation pattern has similar issues 
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associated with it as choosing parameters A and K in the 
AK composite. It is desirable to use the same linear combi- 
nation for all estimates to assure additivity of the estimates. 

I assumed a four parameter model for the correlation 


pattern: 


aGr ~Gr' 


corr (y, +¥s Y= Pips for r-r’=t-s 


B 
=Pj;-5) for r-r'=t-s+8m 


for integer m#0 


=0 otherwise. (16) 


Thus the correlation between estimates at lag k from the 
same rotation group is py if the rotation group contains the 
same dwellings at the two times, and p, otherwise. 
Estimates from different rotation groups are uncorrelated. 
A four parameter model is used: 


pe =(1-1ry)O@prp + O5(1 - rp)) and (17) 


pp =(1-7rz) O4(1 - rp). (18) 


Bell and Carolan (1998) discusses this model. The 
parameter values used in this paper were 0, = 0.87697, 
07- 0.94.7 75 0.3101 and r, = 0.90456. These values 
result from fitting the model to estimated autocorrelations 
for rotation group estimates of proportion employed. 

It is important to note that the BLUE estimates are 
unbiased regardless of the correctness of the assumed 
correlation model. The model used here aims to be optimal 
for estimates of employed persons, but turns out to perform 
well for unemployed persons as well. Trying other values 
for the model parameters did not give any marked 
improvement in standard errors for unemployed persons. 


5.3. Improved BLUE Estimates 


A problem with the BLUE estimates above is that GR 
estimates are required at rotation group level. The lower 
sample size at rotation group level may limit the bench- 
marks that can be used, as discussed for the AK. For the 
BLUE, however, an alternative approach is available. 

The B1 estimator is defined by forming a BLUE estima- 
tor based on the Horvitz-Thompson estimators at rotation 
group level, and then applying the generalised regression 
technique to improve this estimator. This proceeds as 
follows. Define yp = Gr, i)¥4 and ay = Apu jy Xo Where 
A,pq,i) 18 the BLUE multiplier applicable to the rotation 
group unit 7 is in at time t. Then the BLUE estimator based 
on the Horvitz-Thompson estimators can be written 

t 
ae OS EK (19) 
S=t— 1) 5 

Calibrating to the benchmarks we get the improved 
BLUE estimator B1: 

Pata at, )B (20) 
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Properties of the Blue and B1 Estimators 


The BLUE and B1 estimates are sums of weighted unit 
data from a window of months. Each estimate needs only 
data from this window, and can be produced independently 
from the estimates for previous months — so the method is 
not recursive. 

The same month of data will contribute with different 
weights to the estimate for different times. A unit will 
contribute a sizeable weight to its current month estimate, 
and a weight near zero, often negative, to estimates for 
other months. The work required in producing a table is the 
same as for GR multiplied by the size of the window. 
There is also a possibility of negative estimates for tiny cells 
containing no current units. 

Note that in the B1 estimator the weights applied to 
months other than the current one are not forced to sum to 
zero. Under the model assumptions the estimate he 
remains unconditionally unbiased, since Wea and ce are 
unbiased for Y, and X, respectively. In practice the current 
month contributes around 99.5 percent of the total weight. 
I consider the resulting bias to be small and not dangerous 
(leading as it does to some slight smoothing of the estimates 
over time). 

For any estimate in which data from month to month is 
appreciably correlated, the BLUE and B1 estimates should 
have lower sampling error than the GR estimate. This is a 
theoretical advantage over a method that is designed for 
improving a predetermined set of estimates (like modified 
regression or the AK with composite weighting). In practice 
this advantage may not be too important, as for the LFS 
much of our interest is in a small number of well-defined 
estimates. 

The user must also determine the time period or 
“window” from which estimates will be used. Using too 
many time points will be expensive computationally, while 
too few will limit the gains available. The seven month 
window used here was sufficient to obtain nearly all the 
available gains, while smaller windows give noticeably 
higher standard errors. 
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6. COMPARING THE METHODS 


6.1 Method of Comparison 


Estimates for July 1993 to January 1999 were produced 
based on data from January 1993 to January 1999. Esti- 
mates were obtained classified by month, state, sex, marital 
status and employment status. Estimates were also obtained 
for lag one movement, quarterly average and movement 
between successive quarterly averages. 

Standard errors for these estimates were calculated using 
the “delete-a-group jackknife” technique (Kott 1998). The 
geographic units that form the first stage of sample selec- 
tion were divided systematically into G = 30 groups, and 
“replicate groups” were formed consisting of the whole 
sample excluding the units from one of these groups. Each 
estimate studied was also produced for each of the G repli- 
cate groups. Writing e for the estimate and ey the estimate 
from replicate g, the delete-a-group jackknife estimate of 
standard error is given by 


Gaia ; 
Spaeey Mey ane , 
oe 


Estimates and standard errors were obtained for each of 
the following estimators (listed with short mnemonics for 
later reference): 


(24) 


GR: Generalised regression estimate as currently used in 
the LFS 

AK: AK estimate with K=0.7, A = 0.06 

BL: BLUE based on 7 month window 

Bil: Improved BLUE based on 7 month window 

MR2: MR2 estimator (modified regression with a = 1) 

MF: Fuller’s variant of modified regression (a = 0.7) 
The modified regression estimators require a choice of 

the key variables to be optimised for. In producing the 

modified regression estimates in this report, z variables 
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were produced for estimates of employed and unemployed 
for each state and sex. This gives a total of 32 extra auxi- 
liary variables, in addition to the usual 560 post-stratum 
benchmarks used in generalised regression. 


6.2 Differences From GR Estimate 


The current GR estimator can be used as a basis of 
comparison for the other estimators. Rather than present 
graphs of level estimates, I present the differences of the 
alternative estimates from the current GR estimates. 
Graphs 1 and 2 show these differences for estimates of 
employed persons and unemployed persons respectively. 
To put the size of these differences in perspective, note that 
the published standard errors for the current estimate were 
25,200 for employed persons and 7,900 for unemployed 
persons in January 1999 (and similar for other months). 

The AK, BL and B1 estimates are quite similar, since in 
all three methods the contribution of a unit depends on its 
rotation group. In both graphs the AK, BL and B1 estima- 
tors appear to give lower values on average than the GR 
estimates. This indicates a change in the time-in-survey 
bias, resulting from putting less weight on the rotation 
group being sampled for the first time. The estimates vary 
up and down from their average difference for short 
periods. 

The MR2 and MF estimates tend to be different to the 
other estimates since they emphasise the contribution of 
units from the matched sample. For employed persons, the 
MR2 and MF estimators are considerably larger on average 
than the GR estimates, up until September 1997. There is 
then a drop in the differences corresponding to the phase-in 
of a new sample from September 1997. For reasons that are 
not clear, over this period the matched sample behaved 
differently to the overall sample. This affects the difference 
between these modified regression series and the GR series. 
What may be of some concern is that the level change 
influences the level of the MR2 series for a considerable 
period thereafter, possibly a manifestation of the so-called 
“drift problem’. 
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Graph 1. Difference of alternative estimates from GR estimate, employed persons (’000s), July 1993 to January 1999 
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For unemployed persons the M2 and MF estimates tend 
to be lower than the GR estimates. There is no evidence of 
a “drift problem” for unemployed persons, which is not 
surprising given the lower correlations involved. 


6.3 Average Differences by Calendar Month 


To quantify the likely change in bias from moving to a 
new estimator, the average difference over the period of 
each estimate from the GR estimate was calculated. It is 
possible that this difference is seasonal, so averages were 
obtained separately for each month of the calendar year, as 
well as overall. Average differences over the period July 
1993 to January 1999 are given for employed persons in 
graph 3. 

The graph shows that estimates of employed persons 
would have been higher on average using the MR2 or MF 
estimator. This upward difference for the modified 
regression estimators may actually be a feature of the 
particular period, since the difference has apparently 
dissipated since September 1997. 
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The other feature of the MR2 and MF estimates is that 
the difference for employed is highly seasonal. For 
example, the movement from December to January of the 
MR2 estimates is about 40,000 higher than the movement 
in the GR estimates. This suggests that the matched sample 
tends to miss people who were employed in December but 
not in January. The same seasonality shows up in looking 
at estimates from the matched sample directly. The matched 
rotation group movement does not show this large seasonal 
bias. 

For the AK, BL and B1 estimates there is some seasonal- 
ity in the differences, but the differences are much smaller. 

Graph 4 shows the average differences of the various 
estimates from the GR estimate for unemployed persons 
over the same period. Here there appears to be a negative 
difference for all the estimators, though less pronounced for 
the AK, BL and B1 estimates than for the MR2 and MF. 
The change in seasonality from changing from the GR to 
the MR2 and MF estimators is again more extreme than for 
moving to the other estimators 
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Graph 3. Average difference from GR estimate, overall and by calendar month, employed (’000) 
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6.4 Standard Errors 


Standard errors (SEs) of estimates overall, by marital 
status and by sex are presented in the following graphs. 
The SE estimates are obtained as a percentage of the SE 
estimate for the same estimate using the GR method (i.e., 
the current LFS SEs), and these percentages are then 
averaged over the period for which they were produced 
(June 1993 to January 1999 for level estimates). Graphs 5,6, 
7 and 8 show SEs of both employed and unemployed 
persons for level, movement, quarterly average and 
movement of quarterly average respectively. 

For all these estimates the BLUE-class estimator B1 has 
lower sampling error than the AK or BL estimators. Given 
that the B1 estimate appears to have similar bias and 
seasonality of bias it appears that the AK and BL estimators 
used in this study are not competitive with the B1 estimator. 
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The modified regression estimators MR2 and MF, on the 
other hand, give much lower sampling errors than the B1 
estimator for employed persons for overall estimates and 
estimates by sex. These are key estimates used in the 
modified regression — other key estimates such as state 
estimates also gave similarly improved standard errors. 
Estimates by marital status are not key estimates, and these 
have higher standard errors for MR2 and MF than for the 
B1 estimator. 

For unemployed persons the improvement in SEs from 
using MR2 and MF are less consistent, disappearing altoge- 
ther for estimates of quarterly average. The B1 estimator is 
more consistent in lowering the standard error, although the 
gains available for unemployed are lower than for 
employed. 
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Graph 8. Standard error of movement of quarterly average (% of current SE) 


6.5 Seasonally Adjusted and Trend Series 


The ABS uses the X11 package (Shiskin, Young and 
Musgrave 1967) to produce seasonally adjusted estimates 
that aim to remove various calendar effects from the series. 
The package also produces a trend, which is an indicator of 
the underlying behaviour of the series. 

The trend value for a time point is revised as data for 
later times becomes available. I estimated the standard error 
of trend estimates at the end of the series (end trend) and for 
the same points when twelve further months of data are 
available (mid trend). Revisions of the trend (or trend 
movement) were defined as the difference between the mid 
and end values of the trend (or trend movement). The size 


of the revision depends on the shape of the true series and 
on the sampling error in the estimated series. The mean 
squared trend revision for a series of unbiased estimates is 
the sum of two components: the mean squared trend 
revision that would have occurred even with no sampling 
error, and the variance of the estimate of revision. Thus the 
standard error of the revision is a measure of the sampling 
error component of the mean squared trend revision (see 
Bell 1999); 

Seasonally adjusted figures are similarly subject to 
revisions. I present standard errors for level and movement 
of seasonally adjusted estimates at the end of the series. 
Standard errors for later revisions of these estimates were 
very similar. 
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The delete-a-group jackknife technique was used to 
produce estimates of standard error for the various trend 
and seasonally adjusted estimates. This technique requires 
producing replicate versions of the estimates. Unfortu- 
nately, the study provided replicate values for the time 
series only for time points from July 1993 to January 1999. 
Each of these replicate time series were supplemented by 
the previous 9 years of historical data so as to have suffi- 
cient data to apply the X11 package. Because the replicate 
seasonally adjusted and trend series are based on the same 
values before July 1993 the jackknife estimate of SE will 
tend to underestimate the true SE slightly, especially for 
times early in the series. To minimise this effect the mea- 
sures of change in sampling error were averaged over 
months from January 1995 on only (and only up to January 
1998, so that the 12 months to January 1999 can be used 
for estimating revisions). 


Table 1 
Standard error as percentage of standard error of 
current GR estimator 


AK SBE SBI SSMR2ZEME 
Employed persons: 
level 93 92 89 82 83 
movement 95 95 89 66 69 
quarterly average 93 92 89 85 85 
movement of quarterly average 84 82 80 63 64 
seasonally adjusted 94 92 90 87 88 
movement of seasonally adjusted 96 95 Bit 68 71 
trend at end 93 91 89 88 88 
movement of trend at end 86 84 82 65 67 
revision of trend 88 85 83 66 68 
revision of movement of trend 89 86 84 67 69 
Unemployed persons: 
level 100 99 95 96 94 
movement 101. +101 95 87 86 
quarterly average 100 99 95 100 98 
movement of quarterly average 97 95 91 92 90 
seasonally adjusted 100 99 95 96 95 
movement of seasonally adjusted 102. 102 95 87 86 
trend at end 99 98 95 99 97 
movement of trend at end 97 95 92 93 91 
revision of trend oF 95 91 hil 89 
revision of movement of trend 97 95 2 92 90 


Table 1 gives these average standard errors for various 
seasonally adjusted and trend measures, relative to those 
available from the current GR estimator, for both employed 
and unemployed persons. Also in the table are cor- 
responding figures for level, movement, quarterly average 
and movement of quarterly average, as presented in graphs 
5 to 8. 

I would argue that for many purposes the most important 
indicators are those that give the underlying direction of the 
series at the current end, i.e., movement of quarterly 
average, and movement of trend. A reduced standard error 
for these items makes the underlying direction of the series 
at the end clearer, even for users who rely on visual 
inspection or on some smoothing process other than the 
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X11 trend. This in turn improves the ability to detect 
turning points in the underlying series. 

For movement of trend the B1 estimator achieves an 
18% reduction in standard error for employed persons and 
an 8% reduction for unemployed persons. For the MR2 
these reductions are 35% and 7% respectively. The compo- 
site estimators also reduce the contribution of sampling 
error to revisions in the trend series. 


6.6 Summary 


This paper presents a variant of the BLUE estimator, the 
B1 estimator, which applies the generalised regression tech- 
nique to a composite estimate based on a window of seven 
months of data. On Australian data, the B1 has lower 
sampling error than the traditional BLUE or AK estimators 
for a variety of measures including seasonally adjusted and 
trend estimates. The paper also evaluated a “modified 
regression” composite estimator MR2 proposed by A.C. 
Singh and a variant of this proposed by W. Fuller. These 
estimators gave considerably lower sampling errors than the 
B1 estimator for a number of measures, especially those 
based on employed persons. 

The evaluation of a composite estimator will depend on 
many factors other than the sampling errors. The B1 esti- 
mator has the disadvantage that tabulations require 
weighted aggregation of seven months of data, whereas the 
modified regression estimators provide weights for a single 
month’s data. On the other hand, the modified regression 
estimators may be biased if persons reporting in two 
successive months (the matched sample) are not represen- 
tative of other persons (such as people moving house). 
Introducing the modified regression estimators would also 
induce a larger change in estimate and in seasonality than 
introducing the B1 estimator. 
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Regression Composite Estimation for the Canadian Labour Force 
Survey: Evaluation and Implementation 


JACK GAMBINO, BRIAN KENNEDY and MANGALA P. SINGH’ 


ABSTRACT 


The Canadian Labour Force Survey (LFS) is a monthly survey with a complex rotating panel design. After extensive studies, 
including the investigation of a number of alternative methods for exploiting the sample overlap to improve the quality of 
estimates, the LFS has chosen a composite estimation method which achieves this goal while satisfying practical constraints. 
In addition, for variables where there is a substantial gain in efficiency, the new time series tend to make more sense from 
a subject-matter perspective. This makes it easier to explain LFS estimates to users and the media. Because of the reduced 
variance under composite estimation, for some variables it is now possible to publish monthly estimates where only 
three-month moving averages were published in the past. In addition, a greater number of series can be successfully 


seasonally adjusted. 


KEY WORDS: Rotating panel survey; Estimation system; Weighting; Change estimate; Level estimate. 


1. INTRODUCTION 


1.1 Why Composite Estimation? 


The Canadian Labour Force Survey (LFS) is a monthly 
survey of 54,000 households selected using a stratified 
multistage design. Households stay in the sample for six 
consecutive months, thus five-sixths of the sample is 
common between consecutive months. Each month, the 
members of a selected household are asked questions about 
their labour force status, earnings, and so on. In the LFS 
estimation system used prior to 2000, initial design weights 
were modified using regression to produce final weights 
that respect age-sex and geographical (subprovincial 
region) population control totals. Each record then had a 
unique final weight that is used for all tabulations. 

The estimation system used data from the current month 
only. No attempt was made to exploit the fact that the 
common sample can be used to improve estimates. 
However, characteristics such as employment by industry 
are highly correlated over time and unemployment is 
moderately correlated over time, thus there is potential for 
efficiency gains. Because of these gains, surveys similar to 
the LFS, such as the United States Current Population 
Survey (CPS), have used composite estimation to improve 
their estimates for many years. However, the LFS did not 
introduce composite estimation until January 2000. 

In the early 1980s (see Kumar and Lee 1983), the CPS 
approach to composite estimation was studied for possible 
implementation in the LFS. Although the results showed 
that there were efficiency gains for Employed and, to a 
lesser extent, for Unemployed, it was felt that these gains 
were outweighed by the negative aspects of the method. 
These include the fact that the optimal parameters for 
Employed and Unemployed are quite different, which 
would have forced a trade-off between, on the one hand, 
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using a compromise set of parameters, thereby diluting the 
efficiency gains, and, on the other hand, having variables 
that do not add up to totals (e.g., Employed plus 
Unemployed would not equal Labour Force, unless one of 
the three is obtained as a residual). Another factor that 
worked against this form of composite estimation was that 
it was not compatible with the existing weighting, 
estimation and dissemination systems used by the LFS — the 
introduction of composite estimation would have required 
a complete overhaul of these systems. 

Traditionally, the key estimates produced by the Labour 
Force Survey were monthly unemployment rates. However, 
with the increasing emphasis on estimates of employment 
level and on estimates of change in recent years, the need to 
find ways to make use of the common sample also 
increased since these estimates would benefit significantly. 
In the mid-1990s, therefore, interest in composite estima- 
tion was revived at Statistics Canada, and a regression- 
based method that fit in well with the existing LFS 
estimation system was developed. This method is described 
in Singh, Kennedy, Wu and Brisebois (1997) with a more 
up to date version included in Singh, Kennedy and Wu 
(2001). The new methodology allows for a choice of 
methods, depending on one’s objectives. If the primary 
interest is in estimates of level, then one can use level- 
driven predictors in the procedure. If change is most 
important, then change-driven predictors can be used. One 
can go one step further and include both types of predictor 
in the procedure. However, in the latter case, the number of 
independent variables in the regression becomes large, 
which can lead to distortion of the final sample weights. 

Preliminary results based on the new method using 
change-driven predictors and others using level-driven 
predictors were discussed at meetings of Statistics Canada’s 
Advisory Committee on Statistical Methods. The method 
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addressed the problems with traditional composite esti- 
mators and showed substantial gains in efficiency. It was 
noted, however, that the estimator using change- driven 
predictors may lead to a drift in level estimates over time in 
some extreme situations. Also, it was decided, based on the 
committee’s recommendation, that both estimates of level 
and of change should be given importance in the choice of 
predictors. After the exchange of technical notes between 
Wayne Fuller, J.N.K. Rao and Statistics Canada staff, a 
method suggested by Fuller, that combines the change- 
driven and level-driven approaches without the constraints 
associated with including both sets of predictors in the 
regression was adopted (see Fuller and Rao 2001). The 
solution is remarkably straightforward: take a linear combi- 
nation of the level and change predictors: X =(1 -a)X, + 
aX,, and use it as the predictor. The change- and level- 
driven predictors are now special cases. Furthermore, one 
can choose « to reflect the relative importance one wishes 
to give to level versus change. 

The present paper describes the new composite estimator 
in section 2. An extensive evaluation of this estimator was 
carried out using actual LFS data for a number of 
characteristics over a long period of time. The results of 
these studies are summarized in section 3. Unlike traditional 
composite estimators, the regression based composite esti- 
mator requires that the matching of the sample between two 
consecutive months be done at the individual record level. 
This creates some interesting situations where one has to 
deal with nonrespondents and in scope and out of scope 
individuals between two consecutive months in such away 
that the quality of estimates of change is not compromised. 
Section 4 discusses the imputation procedure developed to 
deal with various situations that arise when dealing with 
incomplete data for two consecutive months. Finally, the 
success of this new composite estimator is judged not only 
On its statistical efficiency but its stability over time and its 
cost effectiveness, while achieving the following objectives: 
(i) minimizing changes to the old estimation system, (ii) 
producing a unique weight for each sample unit (iii) 
respecting age-sex and geography control totals and (iv) 
producing consistent estimates (in the sense that, e.g., 
Employed + Unemployed = Labour Force and Labour Force 
+ Not In Labour Force = Population 15+). These objectives 
are discussed at various points in the paper, but especially 
in section 3. 


2. THE REGRESSION COMPOSITE 
ESTIMATOR 


Surveys such as the United States Current Population 
Survey have exploited their sample overlap by using K- 
composite or AK-composite estimators. Initially, the CPS 
used the K-composite estimator 


va =(1 = K)y, +K(y it change, , ,) 


with K = 1/2 for time t, where change, ,, denotes an 
estimate of change based on the common, or matched, 
sample. This was later replaced by the AK-composite 
estimator 


Vem Ce) yar K(y,_,+ change,_, ,) 
+ A(unmatched - matched ) 


with A =0.2 and K = 0.4(see Cantwell and Ernst 1992). 
The optimal values of A and K depend on the variable of 
interest, and using different values for different variables 
poses problems of consistency (in the sense that parts do 
not add up to totals) in this approach. This prompted us to 
look for alternative approaches that satisfy the objectives 
mentioned at the end of the previous section. 

It should be noted that we describe the new approach 
here at the person level, but in practice, person-level infor- 
mation is aggregated to the household level, and household- 
level records are then used by the estimation system. 

To use regression for weighting in the old LFS estima- 
tion system, a regression matrix X is formed. Each person 
in the sample corresponds to a row of X. Each column of X 
corresponds to a control total; e.g., column c may be Male 
20-24, and the value in row i, column c will equal 1 if 
person i is a male between the ages of 20 and 24, and 0 
otherwise (similarly for columns corresponding to geogra- 
phical areas). For further details on the estimation methods 
used by the Labour Force Survey, see Gambino, Singh, 
Dufour, Kennedy and Lindeyer (1998). 

To exploit the sample that is common between months, 
the X matrix is augmented by columns whose elements are 
defined in such a way that when this month’s final weights 
are applied to the elements of each new column, the total is 
a composite estimate from the previous month, i.e., last 
month’s composite estimate is used as a control total 
(strictly speaking, the control total is based on weights that 
reflect the current month’s population). As we noted in the 
introduction, there are several ways to define the new 
columns, depending on one’s objectives. We present below 
only the alternatives that were proposed for implementation. 

A typical new column will correspond to employment in 
some industry, say agriculture. If one is primarily interested 
in estimates of level, the following way of forming columns 
produces good results. Let M and U denote the matched 
(common) and unmatched (birth) sample, respectively. For 
person i, and times t-1 and ¢, let y,, , and y, , be indicator 
variables which equal 1 whenever the person was employed 
in agriculture. Then let 


ve 
xO ee 
Viget 


where y,’, is last month’s composite estimate of the 
proportion of people employed in agriculture; in practice, 
we use y,’,= Y' /P,;,, where P,., denotes the population 
aged 15 and over. The corresponding control total is last 
month’s estimate of the number of people employed in 


if icU 
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agriculture, i.e., Y. io Thus the end result is that the final 
weighted sum of the elements of the new column will equal 
last month’s estimate. This is almost the same as forcing 
this month’s weights, applied to last month’s values for the 
common sample, to reproduce last month’s estimate of 
employment in agriculture (after adjusting by 5/6). We have 
used the superscript L as a reminder that the goal here is to 
improve estimates of level. 

If interest lies primarily in estimates of change, the 
following way of forming new columns of X produces good 
results: 


(©) _ Vie if ic¢U 
; Vir t ROA Vir) if ie M, 


where R is a ratio that adjusts for the fact that five-sixths of 
the sample between months is common. The value 
Rey ee w, is used in the production system. For 
convenience, we used R = 6/5 during development since, 
in practice, the difference between the two is small because 
procedures to balance the weights by rotation group are 
used (e.g., nonresponse adjustment is done separately by 
rotation group). As before, the corresponding control total 
is last month’s estimate of the number of people employed 
in agriculture. Applying the final weights to the elements of 
this column of the X matrix and summing produces the 
equality 
A 1 A ; A M, 
Viet? Joe Ae te 

or, in words, last month’s estimate equals this month’s 
estimate minus an estimate A of Y. a) Ger based on the 
common sample. We use the superscript f in A as a 
reminder that the estimate of change is based on the final 
weights following composite estimation. In terms of the 
“pre-composite” weights, it is easy to show in the univariate 
case that 


‘i a @! - b)Y, SOE eg iy 1)2 


where b is the regression coefficient and A is the estimate 
of change based on the original weights. The more general 
case where auxiliary variables are present is given by Fuller 
and Rao (2001, equation 2.3). 

Earlier results have shown that using the L controls 
produces better estimates of level for the variables added to 
the X matrix as controls. Similarly, adding C controls 
produces good estimates of change for the variables that are 
added. Singh, et al. (1997, 2001) present efficiency gains 
for C-based estimates of level and change and refer to 
earlier results on L-based estimates. 

Early in the development, an estimation system that used 
only the C-based controls was considered. However, there 
was some concern expressed about an estimation system 
based solely on change-driven controls since estimates of 
level are also very important (for example, they play a key 
role in the federal government’s Employment Insurance 
program). These concerns are summarized in Fuller and 
Rao (2001). 
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In principle, we can add both L and C controls to the 
regression, but this would result in a large number of 
columns in the X matrix, which has undesirable conse- 
quences such as an increased number of extreme final 
weights, including negative weights. To avoid this, we 
would have to limit the number of industries included in the 
estimator. Wayne Fuller (see Fuller and Rao 2001) pro- 
posed an alternative which allows us to include the 
industries of greatest interest while allowing us to compro- 
mise between improving estimates of level and improving 
estimates of change. Fuller’s alternative is to take a linear 
combination of the L column and the C column for an 
industry and use it as the new column in the X matrix, i.e., 
use 

X= 5.0) ene + cre 
The original level- and change-driven variables are special 
cases of Fuller’s compromise. 


Choice of a: Fuller and Rao (2001) showed that, based on 
some reasonable assumptions, values of a such as 0.65 and 
0.75 produce reasonable estimates of both level and change. 
The actual choice of a depends on the variable of interest 
(specifically, its correlation over time) and on the relative 
importance of level versus change. 

Our studies (see Appendix 1) showed that for the two 
most important variables, employed and unemployed, the 
best choices of a for estimates of level are 0.39 and 0.24, 
respectively. The corresponding values for estimates of 
change are 0.99 and 0.81, respectively. Clearly, there is a 
need to compromise between the goals of improving 
estimates of level and estimates of change. 

To decide which values of a to study, we obtained 
compromise values of a by averaging the level-driven and 
change-driven values for each variable, i.e., we obtained 
approximately 0.7 and 0.52 for employed and unemployed, 
respectively. Results based on the values a = 1 and a = 0.75 
had already been produced, so we added results for a = 0.67 
and a = 0.6. Based on the results discussed below, which 
show that there are no substantial differences in the results 
for the three values 0.6, 0.67, and 0.75, we chose to 
implement the value a = 2/3 in the production system. 


3. FEATURES, PROPERTIES AND RESULTS 


We present a summary of some of the features and 
properties of the regression composite estimator. Some 
graphical and numerical results are presented in section 3.1 
below. 

Systems implementation. An important advantage of 
the estimator is that it can be implemented within the old 
LFS estimation system in a straightforward manner since, 
essentially, one needs to augment the regression matrix, as 
described above. This was an important factor in our initia- 
tive to study and finally introduce composite estimation as 
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otherwise it would have cost a great deal more to change 
the system. 

Weighting. Unlike the A-K estimator, where weighting 
to satisfy population control totals and composite estimation 
are separate steps, weighting for the regression composite 
estimator is done in one step, i.e., simultaneously with 
weighting to satisfy the age-sex and geographical controls. 
For illustration, the way the ee matrix would be 
augmented when elements S defined in section 2 are 
added is shown in Appendix 3. Adding the elements 
5 et li a)x\” € oxi is similar. This not only preserves 
the consistency mentioned next but also retains the benefits 
of the controls applied to the usual regression estimator, i.e., 
the age-sex and geographic controls in our case. 

Consistency. Because weighting for age-sex and 
geographical controls is done at the same time as weighting 
for the composite estimate controls, consistencies are 
preserved. In particular, parts add up to totals; e.g., 
Employed + Unemployed = Labour Force. In other 
approaches to composite estimation, consistency is 
achieved by other means which require either a separate 
step or a compromise of some kind. 

Efficiency gains. For the variables that are added as 
control totals, there are substantial gains in efficiency for 
both estimates of level and of change. For a = 1, the gains 
for estimates of change can be dramatic; by choosing a 
smaller value of « we gain more for estimates of level while 
reducing the magnitude of the gains for estimates of 
change. Some results for the case a = 2/3 are given in 
section 3.1. 

Seasonal adjustment. The time series of employment by 
various industries are scrutinized by both internal and 
external users of the Labour Force Survey. One important 
consequence of the gain in efficiency is that several of these 
series which could not be seasonally adjusted in the past 
can now be seasonally adjusted. In other words, composite 
estimation increases the signal-to-noise ratio sufficiently 
that seasonal adjustment becomes effective. A related 
consequence of composite estimation that is popular with 
users is that several estimates that were published as three- 
month moving averages are now published as monthly 
estimates. 

Systematic differences between composite and usual 
level estimates. In theory, the expectations, taken over all 
possible samples, for both the usual and composite estima- 
tors should be the same, making them both unbiased or 
almost unbiased. One would therefore expect that the 
estimates of level obtained using the two estimators would 
criss-cross each other over time. In practice, however, this 
does not happen. This is due to the fact that, when actual 
survey conditions are taken into account, the composite 
estimator and the usual estimator do not have the same 
expected value; for example, see Bailar (1975) and Kumar 
and Lee (1983) for results on the K- and AK-composite 
estimator, respectively. Kumar and Lee show this by 
deriving explicit expressions for the expected value of the 


usual estimator and the AK-composite estimator. The 
matched and unmatched samples differ because of differ- 
ences in nonresponse rates and the mode of data collection 
(e.g., personal versus telephone interviewing, centralized 
versus decentralized interviewing). In practice, the units in 
the “birth” sample have a higher nonresponse rate, and the 
missing households tend to be smaller and have higher 
employment rates than the responding ones. Since the usual 
estimator and the composite estimator give different 
weights to the matched and unmatched sample, they will 
have different expected values. Thus time series for the two 
estimators can display systematic differences. In practice, 
these differences are usually swamped by sampling varia- 
tion, but they become evident for more precise series such 
as Employed for big provinces like Ontario and for Canada. 
Our results for Employed are consistent with those 
described by Bailar (1975) for the U.S. Current Population 
Survey, i.e., the composite estimates for Employed tend to 
be smaller than the usual estimates. For Unemployed in 
Ontario, the difference between the two types of estimates 
tends to be much smaller. 

Ways of reducing systematic differences between esti- 
mates from different rotation groups are currently being 
investigated. In particular, the possibility of introducing a 
weight adjustment for the number of households of 
different sizes by rotation group is being studied as a way 
of adjusting for the fact that small households are under- 
represented in the birth rotation. This would benefit both 
the composite estimators and the usual regression estimator, 
and would probably reduce the gap between them. 


3.1 Empirical Results 


Employment and unemployment at the provincial 
level. Graph I shows total employment at the province level 
from 1987 to 1998 for Ontario. The time series for the 
composite estimation series for the four values of a, i.e., for 
0.6, 0.67, 0.75 and 1 behave similarly. In these graphs, it is 
clear that there is a change in level for this series under 
composite estimation — the estimated number of employed 
persons is lower. The seasonally adjusted versions of the 
Ontario employment series based on the usual estimator and 
on the composite estimator for a = 0.67 are shown in Graph 
2 

Graph 3 compares the usual estimates of Ontario 
unemployed to the regression composite estimate for 
a = 0.67. The effect of composite estimation on this 
variable is clearly less pronounced than on employment- 
related variables. 

Graph 4 compares year-to-year changes in Ontario 
employment for the two estimators. Each point in the series 
is the difference between employment in year y, month m 
and year y - 1, month m. For example, the first point is 
January 1988 employment minus January 1987 employ- 
ment. The composite estimation series is clearly smoother, 
especially in the second half of the twelve-year period. 
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Graph 1: Ontario Employed (000’s) Unadjusted 
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Graph 2: Ontario Employed (000's) Seasonally Adjusted 


-——- 


Graph 3: Ontario Unemployed (000's) Unadjusted 
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Employment by subprovincial region. Graph 5 compares 
the usual estimate of employment with the composite 
estimate with a = 0.67 for an economic region in Ontario. 
The results for other subprovincial regions are similar. The 
behaviour of the usual and composite estimate series are 
very similar, thus, the effect of composite estimation is 


Graph 5; Employed in Economic Region 510 
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neither beneficial nor harmful. For special tabulations, the 
LFS estimation system has the flexibility to allow the user 
to add controls at the economic region level if needed. 
There is already a control for the total population in each 
economic region. 
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Employment by industry, and seasonal adjustment. 
The composite estimates were compared to the usual 
regression estimate for sixteen industries. Graph 6A-6D 
show the results for two of them in Ontario. Though not 
included in these graphs, once again, the four values of a 
result in composite estimation series that generally behave 
similarly, although sometimes the series for a = 1 departs 
from the others. The composite estimation series tend to be 
less volatile than the regression series. This is particularly 
noticeable for the seasonally adjusted Trade series which 
we have included here because it illustrates the most 
extreme case. For this series, the behaviour of the original 
regression estimates in the first few years, in both the 
seasonally adjusted and unadjusted series, is difficult to 
explain from a subject-matter viewpoint. The behaviour of 
the Manufacturing series is more typical of the remaining 
fourteen industries. 

Comparing the seasonally adjusted (Graph 6D) and 
unadjusted (Graph 6C) series for Trade, we see that 
seasonal adjustment has had relatively little effect on the 
regression series, but has changed the composite series 
significantly. This is a manifestation of the ability of 
composite estimation to increase the signal-to-noise ratio 
sufficiently to make seasonal adjustment effective. 

The seasonal adjustment program used by the Labour 
Force Survey computes a variety of measures that are used 
as indicators of the effectiveness of seasonal adjustment. 
Some of these measures are presented in Appendix 2. These 
show that, for Ontario employment in the twelve-year 


period 1987-1998, composite estimation increases the 
number of industries that can be successfully seasonally 
adjusted. Results for other provinces and for Canada as.a 
whole are similar. 

A measure of stability. For several important data 
series, instead of monthly estimates, three-month moving 
averages were published in the past. This was due to the 
high sampling variability associated with these series, 
leading to unacceptable volatility in the monthly series. Of 
particular interest are province-level estimates by industry 
and by class of worker. It had been anticipated that the 
composite estimates for these series would demonstrate 
more stability, allowing the publication of monthly esti- 
mates instead of three-month averages. A measure of 
stability, the index of volatility, is computed as follows. For 
each industry, the month-to-month change in employment 
is calculated from seasonally adjusted estimates. Then the 
difference between consecutive change estimates is 
computed. The absolute value of this “change in the 
change” is expressed as a percentage of the corresponding 
monthly total estimate. These percentages are then averaged 
over the entire year. Large values of this measure occur 
when a series has many consecutive movements in opposite 
directions, indicating volatility. 

The volatility index was computed for sixteen industries. 
Graphs 7A and 7B for two of these industries, Ontario 
Manufacturing and Trade, are included here, comparing the 
usual estimator, the three-month moving average of the 
usual estimator and the montly composite estimator with 
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6C: Ontario Employed in Trade (000's) Unadjusted 
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Graph 6. Selected Employment by Industry 
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7A: Ontario Employed in Manufacturing 
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7B: Ontario Employed in Trade 
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Graph 7. Index of Volatility 


a= 0.67. For Manufacturing, the average indeces for the 
usual, composite and moving average estimates are 2.4, 1.8 
and 0.60, respectively. For Trade, the corresponding values 
are 2.4, 1.9 and 0.55. For all industries, the volatility of the 
composite estimates typically falls between that of the usual 
monthly and three-month average estimates. Occasionally, 
for isolated years, the composite estimates are less volatile 
than the three-month averages or more volatile than the 
usual monthly estimates, but generally the volatility of the 
composite estimates is between that of the usual monthly 
estimates and that of the three-month moving averages. We 
also note that when the usual monthly estimates exhibit 
extreme volatility, the composite series tend to be more 
stable. The monthly regression estimates compete with the 
composite estimates only when the volatility index is low 
for both of them. 

With the introduction of composite estimation, 
three-month moving averages were dropped in favour of the 
more desirable monthly estimates for industry series. 

Variance estimates. For variables that are added as 
control totals, such as employment by industry, there can be 
substantial gains in efficiency at the province level, where 
efficiency is defined as Var(greg)/Var(composite). For most 
industries, gains of 10 to 20 percent are typical, but they can 
be as a high as 40 percent. A 40 percent efficiency gain 
corresponds, for example, to reducing a 15 percent coeffi- 
cient of variation to 12.7 percent and a 10 percent coeffi- 
cient of variation to 8.5 percent. For province-level employ- 
ment and unemployment estimates, the efficiency gains are 
more modest, typically in the five to ten percent range. For 
estimates of month-to-month change, the efficiency gains 
for controlled variables are bigger, usually more than 
double the gains for estimates of level. 

For variables that are not controlled, there is little or no 
effect of composite estimation on efficiency unless the 
variable is highly correlated with a controlled variable. For 
example, at the province level, Employed Males shows a 
gain in efficiency because it is correlated with total 
employed, which is controlled. On the other hand, employ- 
ment by subprovincial economic region shows neither gains 
nor losses. 


4. TREATMENT OF MISSING DATA 


By definition, the x, variables involve data from the 
current and previous month. This leads to complications 
when, for a given person in the common sample, data is 
available only for one month. This may occur due to non- 
response in either month or when a move or change in 
scope has taken place between the two months. The 
different cases that may occur are represented in the follow- 
ing diagram, where R denotes a response, X denotes a 
nonresponse and O denotes a unit that is out of scope. 


Fe hal ar Sr cee ole 
[Month | xxx.. | ree. | ree. | ReR.. | 000.. 
jontiet | ree. | xxx... eee. | 000.. [ ere. | 


In all these cases, namely A, B, C, and D, the objective 
is to find a solution such that X,_ .w,,X,, is still an estimator 
of Y,_,. We set the following two objectives for handling 
the situation of missing data from either month of the 


common sample: 


i) retain as many valid responses as possible, i.e., the 
option of removing a unit from the estimation process 
is rejected 


ii) develop an imputation method that does not 
understate the estimate of change in any significant 
way. 

In the case of nonresponse, there are two situations: Case 
A, where a household responded last month but not this 
month, and Case B which is the reverse situation. In the 
following, i denotes a person in an affected household. 


Case A: Replace y, by ¥,,. This can be achieved in a 
number of ways. A simple approach is to replace y,, by the 
corresponding response from the previous month, i.e., 
y;, --1- During the early stages of the study, this approach 
was used but rejected later as it can bias (understate) the 
estimate of change significantly. For the LFS estimation 
system, it was decided to use the previous month’s known 
demographic and employment characteristics of persons to 
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form imputation classes and then use hot deck imputation 
(i.e., current month’s data) to obtain Jy, An alternative 
would be to use a mean of some sort. 


Case B: The procedure is analogous, i.e., when last 
month’s value is missing, then imputation classes are 
formed using data from month ¢ and the donor is found 
using data from responding units in month ¢- 1. 

In the case where unit i has moved or changed scope, the 
following situations may arise. 


Case C: Suppose that unit i was out of scope at time ¢- 1 
but is in scope at time f (e.g., a person who just turned 15, 
or a newly arrived immigrant). Then unit i should 
contribute 0 at time ¢-1 and y, at time ¢. Hence we let 
x, = O since  w,,x,, should estimate Y,_,. 


Case D: Conversely, suppose that unit i was in scope and 
is now out of scope. This includes, e.g., people who left the 
country, joined the military or died. Such units should be 
dropped since the target population is the in-scope 
population at time ¢ (and the ultimate goal is to estimate Y,). 
Since we sample dwellings but collect data for individuals 
within those dwellings, two other situations arise due to 
movement of persons in and out of the sampled dwellings. 


Case i): Suppose that unit i was in the population at both 
times but in a sampled dwelling only at time ¢ (i.e., a person 
who moved from a non-sampled dwelling to a sampled 
dwelling). Then his/her status at time ¢- 1 is unknown, i.e., 
Y,,-1 18s unknown. For all such cases, as in the nonresponse 
case, we can impute a value y,,_, for y,,_, either froma 
donor in the sample or by a sample mean. The LFS uses hot 
deck imputation. 


Case ii): Finally, consider the case where unit i was in the 
Sample at time ¢-1 but moved to a non-sampled dwelling 
at time ¢. Since the LFS sample is a sample of dwellings and 
not a sample of people, this unit should simply be dropped 
when computing estimates of Y.. 


5. CONCLUSION 


The composite estimator described in this document 
meets all the objectives that were set at the beginning of this 
project and summarized in the introduction. It produces 
estimates of level and change that are more efficient than 
the estimates produced by the usual regression estimator 
while satisfying all operational and consistency constraints. 
The impact of the composite estimator with the value 
a = 2/3 on the many time series produced by the Labour 
Force Survey is generally moderate. When the impact is 
substantial, as in the Ontario Trade series, for example, the 
new series tend to make more sense from a subject-matter 
expert’s perspective. This type of improvement in the series 
makes it easier to explain LFS estimates to users and the 
media. 

The composite estimates have other features that users 
find very desirable. Because of the reduced variance under 
composite estimation, it is possible to publish monthly 


estimates in many cases where only three-month moving 
averages were published in the past. In addition, a greater 
number of series can be successfully seasonally adjusted. 

Implementation of composite estimation for the LFS is 
an important first step. Studies to improve the treatment of 
nonsampling errors are ongoing, and their results can be 
incorporated into the weighting and estimation system at 
any time. The system has the great advantage that it is very 
flexible. For example, the value of « can be changed easily, 
hence a comparison of a broad range of « values for a 
number of important variables is planned. This may lead to 
a system in which different « values are used for different 
control variables, while still having a unique final weight 
per record. 
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APPENDIX 1 


Relationship between «, p and (A, K). Kumar and Lee 
(1983) found optimal values of A and K in AK-composite 
estimation for estimates of level and change as a function of 
the correlation coefficient . We derived an approximate 
relationship between the A and K values, p and a. This 
result was then used to find good values of « for several 
variables. These are presented in Tables 1 and 2 for 
estimates of level and change, respectively. The A and K 
values in the tables are the optimal ones for the corres- 
ponding value of p. The values of a in the tables are 
consistent with those obtained by Wayne Fuller based on an 
AR(1) model (personal communication). The value of « for 
Labour Force in Table 2 exceeds one because of the 
approximation. 


Table 1 
a Values for Several Variables — Level 


Variable p A Ke. 


Employed 0.852 0.49 0.8 0.385 
Unemployed 0.58 0.38 0.5 0.242 
Labour Force 0.843 0.48 0.8 0.403 

E.P. Agriculture 0.955 0.38 0.8 0.448 


Table 2 
a Values for Several Variables - Change 


Variable p A K v4 


Employed 0.852 0.1 0.9 0.995 
Unemployed 0.58 0.2 0.6 0.806 
Labour Force 0.843 0.1 0.9 1.009 

E.P. Agriculture 0.955 0 0.9 0.959 
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APPENDIX 2: 
Seasonal adjustment measures for Ontario employment by industry 


NY MN M7 
87.76 3 
Utilities | 429] 3.48] 68 


Description of Measures 


F-value: F-value for the test performed within the X11-ARIMA program to check for the presence of stable seasonality. The 
higher the F, the more significant is the presence of stable seasonality. 


M7: Measure that combines the test for stable and moving seasonality. Generally, when M7 is greater than 1, there is no 
identifiable seasonality present in the series; therefore, the series should not be adjusted. 


SMOOTH: Percentage difference between the standard deviation of the month-to-month changes in the original series and 
the standard deviation of the month-to-month changes in the seasonally adjusted series. The larger this value, the more 
smoothing was obtained through the seasonal adjustment process. 
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APPENDIX 3: 
Implementing Regression Composite Estimation within the LFS Estimation Framework: 
Illustrated Using the Change-driven Approach 


Original X matrix 


Age-sex indicators 


Region indicators 


Population control 
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Modified X matrix for composite estimation when x; are added 


Xo Ey 10! As nee ms 


E’ is last month’s | 
employment estimate 


For birth units, set a, b, c, . . . to indicate this month’s status 
(e.g., a=1 if employed, 0 otherwise). For matched units, do 
the following: 


a=e,+(e,, - &,)x6/5 where e=1 if person is employed, e=0 
otherwise 

d=ag, + (ag,,. ag,)x6/5 where ag=1 if person is employed 
in agriculture, ag=0 otherwise 


Examples: 


(i) | Suppose Person 2 was employed in agriculture both 
last month and this month. Then €, -e,=1 and 
ag,_,= ag, =1,hencec=1-0=landd=1-0=1. 


(ii) | Suppose Person 2 was employed in agriculture last 
month and in mining this month. Then e,, = e, = 1, 
ag., = 1 andag, =Ohencec=1-0=1andd=0 + 
(1 - 0)* 6/5 = 1.2. 


(uli) Suppose Person 2 was employed in mining last 
month and in agriculture this month. Then e, ,=e,=1, 
ag,, =0 and ag,= 1 hencec=1-0=1andd=1+ 
(O- 1)* 6/5 = -0.2. 
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Variance Estimation After Imputation 


JAE-KWANG KIM!’ 


ABSTRACT 


Imputation is commonly used to compensate for item nonresponse. Variance estimation after imputation has generated 
considerable discussion and several variance estimators have been proposed. We propose a variance estimator based on a 
pseudo data set used only for variance estimation. Standard complete data variance estimators applied to the pseudo data 
set lead to consistent estimators for linear estimators under various imputation methods, including without-replacement hot 
deck imputation and with-replacement hot deck imputation. The asymptotic equivalence of the proposed method and the 
adjusted jackknife method of Rao and Sitter (1995) is illustrated. The proposed method is directly applicable to variance 


estimation for two-phase sampling. 


KEY WORDS: Two-phase sampling; Item nonresponse; Deterministic imputation; Random imputation. 


1. INTRODUCTION 


Imputation, inserting values for missing items, is com- 
monly used for handling missing survey data. An advantage 
of imputation is its convenience. That is, we can apply 
standard complete data programs for computing point 
estimates to the imputed data set. Rubin (1996), Fay (1996), 
and Rao (1996) reviewed various issues on imputation. 

All imputation methods use some type of model. After 
designating a model, we can use either deterministic impu- 
tation or random imputation based on the model. Under 
random imputation, missing values are imputed by the use 
of some form of probability sampling. We call this addi- 
tional random mechanism the imputation mechanism. On 
the other hand, deterministic imputation does not introduce 
an additional random mechanism. When the set of respon- 
dents is viewed as a random sample from the original 
sample, the selection mechanism of the respondents is 
called the response mechanism. The response mechanism is 
often regarded as the second phase of sampling. See 
Sarndal and Swensson (1987) for details. 

With a suitable imputation model and method, the bias 
due to nonresponse can be greatly reduced relative to using 
only the observed data. However, it is well known that a 
variance estimator which uses the imputed data as if it were 
observed data is inconsistent. 

Various methods have been proposed for variance esti- 
mation after imputation. Rubin and Schenker (1986) and 
Rubin (1987) advocate multiple imputation. Multiple impu- 
tation creates multiple data sets and calculates the complete 
data statistics for each imputed data set. The variance esti- 
mator is calculated by combining two terms, the within- 
dataset variance term and the between-dataset variance 
term. Multiple imputation applies standard variance estima- 
tors to each data set to compute within-dataset variance 
terms and applies the standard point estimators to compute 


a between-imputed-dataset variance term. This method 
requires the imputation method to be proper. That is, the 
imputation should satisfy conditions 1-3 in Rubin (1987, 
pages 118-119). These conditions are not always easy to 
achieve. (For example, see Fay 1992). Even the multiple 
imputation methods described in Schafer (1997) are not 
shown to be proper in the sense of Rubin. As noted by Rao 
(1996), some commonly used imputation methods, 
including hot deck imputation and regression imputation, 
are not proper. 

Rao and Shao (1992) and Rao and Sitter (1995) 
proposed an adjusted jackknife variance estimator. The 
suggested procedure is applicable to a number of impu- 
tation methods and sample designs. The actual calculation 
using standard complete data software is not easy because 
special computations are performed to adjust the imputed 
values for each pseudo replicate. Also, Sarndal (1992) 
proposed a variance estimation method that explicitly uses 
the model considered for imputation. 

Essentially, Rubin’s method generates several pseudo 
data sets for variance estimation and applies the standard 
variance estimators to each data set to compute the within- 
dataset variance terms, while Rao’s method and Sdrndal’s 
method apply a special variance estimator to the imputed 
data set. In this paper, a method to create a single pseudo 
data set for variance estimation is proposed. In section 2, 
the new method is introduced in a two-phase sampling set- 
up. In section 3, we illustrate extensions of the suggested 
method to the random imputation method. In section 4, we 
extend the suggested method to complex sampling designs. 
In section 5, comparisons are made with the adjusted jack- 
knife variance estimator. In section 6, a limited simulation 
study is presented. Some concluding remarks are made in 
section 7. Outlines of some proofs are given in the 
appendix. 


‘ Jae-Kwang Kim, Westat, 1650 Research Boulevard, Rockville, Maryland, 20850, U.S.A. 
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2. A VARIANCE ESTIMATION METHOD 


We outline a variance estimation procedure applicable 
for two-phase samples and for imputed samples. The proce- 
dure requires a separate data set for variance estimation in 
addition to the tabulation data set. To introduce the proce- 
dure and to illustrate the concepts, consider a two-phase 
sample. Let the second phase be a simple random sample of 
size r selected from the first phase, which is a simple ran- 
dom sample of size n selected from an infinite population. 
Let the regression estimator of the mean of a characteristic 
y be 


ays? Gee) bs (1) 


where 
Fe 
Oa aS ean 
i=l 
Xx, =n = Ss Xs 
i=l 


r ell r 
» oy » (x,-X,) Cy; -¥2) 


and the second phase units are indexed from one to r. It is 
well known (e.g., Cochran 1977, equation 12.51) that the 
variance of the regression estimator is, approximately, 


Vi{a,} =[n*p? +r -p*)Jo;, @) 
where p is the population correlation between y and x and ce 


is the population variance of y. An estimator of the variance 
is, by classical regression theory, 


A 


V (f,} =n *(n-1)7 2) ($,- y,) 


Hpcu a?) he ne (¥;-3,¥ (3) 


=) Pimfor vies) 2.127 fe and 
y,=n- ‘Zia §;. Observe that y, is an alternative way of 
writing fi, in (1). 


where ¥, = y, + (x; 


Let 
c,=[n(n-1)r(r-2)-1” (4) 
and 
hie 133 art leit 2s, 7 
mt yey aloo a 
Then, 
V {A} = nti (y; - 9) (6) 
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where y, is the mean of the y;, as well as the mean of the yj» 
because the sum of y, -¥, is zero. Equation (6) is the 
operational form of the suggested estimator. The variance 
estimation data set contains the pseudo observation y;. 

To the extent that the model for imputation matches that 
of two-phase sampling, equation (6) is applicable to an 
imputed data set. For example, if we assume that missing 
data are missing at random and use regression to impute the 
missing value with y,, then equation (6) is immediately ap- 
plicable. Of course, regression imputation or two-phase 
sampling can use a vector x. 


3. EXTENSIONS TO RANDOM IMPUTATION 


A moderate extension of the method described in section 
2 enables us to estimate the variance of a sample mean 
using random imputation. In fact, alternative approaches 
are possible. 

As one approach, assume that the imputation model is 
the regression model 


y= XB ne; (7) 
where the first element of every x, is equal to 1 and the e; 
are uncorrelated (0, o 5) random variables. 
Assume the model is estimated and that the imputed 
values are 


Ve Sa) ale vg, haze aie a Pta yee I? (8) 
where 9, = x, with B = (X/_,x 
chosen at random from the __ set é = 
{é,=y,;-9,;i=1,2,...,r}. The estimator of the mean of y is 


Sas ole be 


ed ep Sah (9) 
pa 


where ¥, = yi =) 2 cal 

If the @, are chosen with replacement with equal 
probability from the set €, then the variance A, is, approx- 
imately, 


V{ii,} =[21R2+ (71 +n %m)(1-R?)]o2, (10) 


where m =n -r and R? is the squared multiple correlation 
coefficient between y and x. The increase in variance due 
to using random imputation with é,, rather than using 
é,=0, isn -?m(1 - R?)o.. 

Therefore, an estimator of the variance of the imputed 
sample mean is given by (6) where the c, of (4) is 


c,=|n(n - 1) (r7! +n *m)(r - p)'}*, 
and p is the dimension of B. We have 


(11) 


V{fi,} =n 'n-1)1 YO 9,-9, 
i=] 


Hr" + nim) (py) 9) 12) 
i=l 
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where y , = ¥;_, ¥,. The estimator of the variance using c, 
of equation (11) is an estimator of the unconditional vari- 
ance, the average over all possible imputed sample. 
Derivations of (10) and (12) are given in Appendix A. 

To consider an alternative variance estimation approach, 
we assume that a random selection procedure is used for 
imputation but place no restriction on the procedure, other 
than that the probabilities of selection are inversely propor- 
tional to the probability that the y-value responds. In ad- 
dition, we record the number of times an é value is used as 
a donor in the imputation. 


Let 
al BF pars rt Qo... 
Paes pata. bey (13) 

with 
c, =[n“\n - Ir - py 4 +d) (14) 


where d, is the number of times é, is used as a donor. The 
term [n Tin -1)r(r- p)'|* is used to adjust for the effect 
of estimating p regression parameters. Then, the variance 
estimator (6) can be written as 


+n*r(r-p)! >> (1 +d, (y, - 9,)?- (15) 
i=l 


If the imputation method is simple random sampling with 
replacement, then, conditional on the sample and the 
respondents, 


eforay}e(2) +2 (1-4) 


where the notation J is used here to denote the expectation 
with respect to the imputation mechanism generated by 
random imputation. The equality in (16) establishes the 
equivalence of (12) to (15) under with-replacement 
selection. It is shown in Appendix B that V{ji,} in (15) is 
also a valid estimator when donors are selected without re- 
placement. Since the proposed variance estimation method 
is the conditional variance given the realized imputed 
sample, it has wide applicability. 


(16) 


4. COMPLEX SAMPLING DESIGNS 


4.1 Deterministic Imputation 


The suggested method is applicable to complex sampling 
designs as well as to simple random sampling. Assume that 
the full sample estimator of the mean of y can be written as 
y = L}_,w, y,, where w, is the sampling weight of unit i in 
the sample. Assume that 1/_, w, = 1. 


Th 


If the first r elements are observed and the remaining 
n —r elements are missing, then the estimator of the mean 
of y under regression imputation is 


ip n 
y¥,=Dwy,+ YO w,J, (17) 
t= t=r+1 
where 
yj "i x, B, 


D> 
ee. 

M 3 

5 

™ 

* 
ay) 

M = 

= 

a 

<< 


Here w, is the sampling weight of unit i in the second- 
phase sample and is defined by 


w; =[Pr(i is in the second phase sample | i is in the 
first phase sample)|”' w,. 


Also, ©;., w; = 1. If we assume that the second phase 
sample is a random sample of size r from the n first phase 
sample, then w; =nr~'w,. Under certain conditions we 
can write the estimator in (17) as 


Vee il: (18) 
i=l 

The representation (18) holds if (w; ‘2 w, is in the column 

space of the matrix X = (x,, ....x,)’ because then we have 

Li -1W,(y, -3;) = 0 from L;_,w; x} (y, - ¥,) = 0. 

We assume a sequence of samples and finite pop- 
ulations such as that described in Fuller (1998). Define 
K, = Lj_,w,x, and (,,y,) =L;_,w;(;,y,). We also 
adopt the same assumptions as in Fuller (1998). That is 


E(,, %,)>) fo (u,., H,, M,), (19) 


and 


V{(B - BY’, X,,,,5,} = OM), (20) 


3 el el 
where (,,H,) = N* Ei, (x,,y,) and B = (Lj. X/x;) 
Sire +P 
For 7 = 1-2... N;, define 
1 if unit 7 responds when sampled 
a. = 
; 0 otherwise, 
and a=(4,,4,,..., ay). The extended definition of a, is 


discussed by Fay (1991) and used in Shao and Steel (1999). 
Now, let 


Die dw, y; (21) 
iat 


where 


I; = 9; +a, Ww; W; (9; ea ey) 
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with y, =x,B. Then, we have y, = yy + &,- x,) (p - B). 
By (19) and (20), we have y, = ¥,+O,(n~') and 
Vy, = Y,) ia Vy - Fi) +o(n =) Now, 


Vy Vy) = VIEOn- 


The first term on the right side of (23) is 0 because 
E (¥,,- %| a) = 0 under model (7). To estimate the second 
term in (23), note that conditional on a, y, is a linear 
estimator. Hence, the standard Variance estimation method 
applied to the pseudo data set Y"= { Ved ak, oh 0} will 
unbiasedly estimate the variance of y,, = L;-,; y;. Since 


¥,|a)] +E(V(¥,-¥y)|a)]. (23) 


the set Y* is not observable, we can use the set 
Vo= {y, 1 = 1, 2,5... 7}, where 
y, = 9, +a,w, w;(y,-9,) (24) 


to get a consistent variance estimator. 

To illustrate that the set Y* can be used to approximate 
the variance estimator, assume that the full sample variance 
estimator of y can be written as 


L 
Va ey), 
i=1 


where L is the number of replications, c, is the i-th 
replication factor, and Vee = 2 iw, M, : y, is the i-th repli- 
cate of y. The term mM” is the replication multiplier 
applied to the weight of unit j at the i-th replication. For 
example, under simple random sampling, the jackknife 
multiplier is 


Haier 
My- = Linh? 
f i =), 


ifi#j 
Assume that the replicate variance estimator V is applied 


to the set Y* to get 


where py,” =i, My, 
Then, we have 


with y, ” being defined in (24). 


<Q) ELC) 


yy -Y; =; ) _ =@) 


Vt Xie eon 


x, +x,) (6 - B) (25) 


where 


(5x) Ly, Mx, aw, Ww, | x))- 


It is shown in nah: C that 
L 


= Delyi” 


eu | 


~Fy) + 0,007). (26) 


Therefore, the standard jackknife variance estimator applied 
to the pseudo data set Y* can be used to approximate the 
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standard jackknife variance estimator applied to the pseudo 
data set Y*. 


4.2 Random Imputation 


The arguments for variance estimation with random 
imputation are quite similar to those for deterministic 
imputation described in the previous subsection. First, 
define the imputation indicator function 


1 
dj, = ‘ 


Then, the estimator of the mean of y using random 
imputation is 


if unit 7 is used as donor for unit 7 


otherwise. (27) 


y, = yw,» (28) 
where . 
¥, =P, +a, +a) Oey) (29) 
and 
= 2 (1 a)) yw," w;. (30) 


If the original sample weights are the same, then d, is the 
number of times that unit 7 is used as a donor. We assume 
that 


Ela,(+d4,)[F) =1 (31) 


where F, = {(i,x,,y,);i = 1,2, ...,n}. The expectation in 
(31) is with respect to the joint distribution of the response 
mechanism and the imputation mechanism. Then, we have 


SAO ei) ins 8 


If we assume equal response probability, then, by (31), the 
probability of selection of donors should be proportional to 
the weights. This is the Rao and Shao (1992) setup for 
random imputation. 


Now, let 
1 = 2,19, + 4,(1 +4,) (0 -F,)) (32) 
i=l 
where y,=x,B. Then, we also have y,= 


(x, -X,)(B-B) y, + where x, = L7_,w,a,(1 + d,)x,. By 
the assumption (31), we have E(x, x, |F,)= 0. Under 
mild conditions, ¥, - X, = O,(n “*) and ¥,=V, + O,(n 2): 
Now, 

Viyialy) z= Vie, 


where d = (d,, b> -» Zy). Conditional on a and d, the 
estimator y,, is a linear estimator. Hence, the pseudo data 
y; Soy, nia st a) (yey) (33) 


can be used to estimate the variance of y,. 


|a,d)] +E[V(y,-Y,| a, d)] 
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5. COMPARISONS WITH ADJUSTED 
JACKKNIFE METHOD 


Rao and Sitter (1995) proposed an adjusted jackknife 
variance estimator for the ratio imputation problem. Under 
the setup described in section 4, the ratio imputed estimator 
of pu, is 

y 


n 
A, = )w,la,y,+(1 -4,)¥,] 
Fa 


with y, =x, R and R = (D"_w,a,x,) D7 yw,a,y,. The Rao 


Lombart U 


and Sitter (1995) variance estimator is 


ens Seal May (34) 


i=l 


where the adjusted jackknife replicate at the i-th replication 
is 


(i) © ®.@ 
A; = Dow, My, (35) 
Hat 


where 


: (36) 


with p= (Si, M,?a,x,) Sj.w,M,?ay,. The 
adjusted values (36) in the Rao and Sitter (1995) method 
can also be regarded as pseudo data for variance estimation. 
Note that the calculation of the pseudo data (36) requires 
recalculation of R“ for each i with a, = 1. 

We modify the calculation of the pseudo values y, in (5) 


to 


y, if a,=0 


y; isle | x | f ; 
Ve ohare Oy) I Ogee 

2 (37) 
where <x, = Did wir ina, Le ee w, x, and 
c,=r'n. The term (x,/X,) is inserted to improve the 
conditional properties of V, given the first phase sample. 
The resulting variance estimator is approximately equiv- 
alent to the adjusted jackknife variance estimator (34). To 
see this, note that the adjusted values (35) can be written in 
the form 


(‘) 
i) “ “) » a, Os” 
ile te phe a po 
2 (i 
ws, ih 


where A =:B denotes that we define B to be A. Also, 


pie js ae 
define Z = X7_,w,x,,S=L;_w, ay, and 7 Yj 21; 4) X- 
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Then by the first order Taylor expansion, 


205 228 (20-2) § 
(50-9) 2 (90-1) 
we Te 


(38) 


(gOS ,Z [3° ie 
is 


Note that the right side of (38) is exactly equal to 
yw mM 2 roa, Wiawe 
i CS le 


j=l 
Thus, the pseudo data for variance estimation can be written 
as 


which reduces to (37). Hence, the proposed method is 
exactly a first order Taylor linearization of the Rao and 
Sitter method in the case of ratio imputation. Therefore, we 
can expect our proposed method to have the same 
asymptotic properties as the Rao and Sitter method up to the 
order of n 7}. 

The variance estimation method using the pseudo data 
set calculated by (37) is easy to implement because we can 
directly use existing software, which is more difficult with 
the Rao and Shao (1992) or Rao and Sitter (1995) method. 
Furthermore, if we calculate the pseudo data by (13), then 
the data set works for without-replacement hot deck impu- 
tation as well as for with-replacement hot deck imputation. 


6. ASIMULATION STUDY 


The preceding theory was tested in a simulation study 
using an artificial, finite population, from which repeated 
samples were drawn. The population has L = 32 strata, N, 
clusters in stratum h, and 20 ultimate units in each cluster. 
The values of the population parameters were chosen to 
correspond to real populations encountered in the U.S. 
National Assessment of Educational Progress Study 
(Hansen and Tepping 1985) and are listed in Table 1. The 
finite population units are 


Vnizy = Yni * Pniz? 
where 


OMA Agr tea TS ew aan Oe 


and 
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nfo, ee ci. j- 
) 

Shao, Chen and Chen (1998) also used the same population 

in their simulation study. The value of the intra-cluster 

correlation p considered in the simulation is p = 0.3. Simu- 

lations with other values of p produced similar results and 

are not listed here for brevity. 


Table 1 
Parameters of the Finite Population for Simulation 


ae ed es pe eX My oS; 

1 13 100.0 20.0 2 16 95.0 19.0 
3 20 90.0 18.0 4 25 98.0 19.6 
3 25 93.0 18.6 6 25 98.0 19.6 
7 23} 96:07 7 192 8 28 940 18.8 
9 28 92.0 18.4 10 28 96.0 19.2 
11 Sil 940 18.8 12 31 92.0 18.4 
is 31 90.0 18.0 14 ol 96.0 19.2 
15 31 94.0 18.8 16 31 92.0 18.4 
17 31 90.0 18.0 18 31 88.0 17.6 
19 31 86.0 17.2 20 34 84.0 16.8 
2 34 82.0 16.4 7p) 34 80.0 16.0 
23 34 S00 1876 24 37 S50. li7.0) 
Ds) 37] 80.0 16.0 26 SI 90.0 18.0 
Ay 37 fod). LIL 28 39 80.0 16.0 
29 39 7>.0e  ts:0 30 42 75.0 15.0 
31 42 (20. SIO 3p 42 PW, WEY 


We consider a stratified cluster sampling design, where n, =2 
clusters are selected with replacement from stratum h with 
equal probability and all of the ultimate units in the selected 
clusters are in the sample. The sampling fraction is 6.4%. 
For each sampled unit y,,,, a response indicator variable a, . i 
is generated from 


a, Bernoulli (p), 


and that Any is independent of y,,. The value of p 
considered in the simulation are p = 0.9, 0.8, 0.7, 0.6, and 
5? 

A set of 5,000 samples were selected using the same 
sampling design. In each of the selected samples, three 
imputation methods are considered; 


[M1] With-replacement weighted hot deck imputation 
considered by Rao and Shao (1992), where a miss- 
ing value is imputed by a value randomly selected 
from the respondents with replacement with prob- 


ability proportional to the survey weights. 


Without-replacement weighted hot deck impu- 
tation, which is the same as [M1] expect that the 
selection was performed using a _ without- 
replacement sample. The without-replacement 
selection of donors is carried out systematically 
using the method described by Hansen, Hurwitz, 
and Madow (1953, page 343) from the respondents 
sorted by random order. 


[M2] 
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[M3] Overall mean imputation, where the weighted 


mean of the respondents in the sample is imputed. 


Hence, all the imputation methods use a single imputation 
cell that collapses all the strata. 

In each imputed data set we computed three variance 
estimators V, naive variance estimator treating the imputed 
data as if it were observed data, Ve the adjusted jackknife 
variance estimator of Rao and Shao (1992) for [M1] and 
[M2] and of Rao and Sitter (1995) for [M3], and V", the 
jackknife variance estimator based on the pseudo data. The 
pseudo data set is constructed by (29) for [M1] and [M2] 
and by (24) for [M3]. The complete sample variance 
estimator used a standard jackknife for stratified cluster 
sampling, in which a cluster is deleted for each replication. 
Note that the standard jackknife is a consistent estimator of 
the variance under the model with nonzero intracluster 
correlation. Thus, the standard jackknife method based on 
the pseudo data can be applicable to the data set considered. 
The point estimators of the population mean are unbiased 
under the three different imputation schemes and are not 
listed here. 

Table 2 presents the relative bias of the three variance 
estimators, the standard error of the relative bias of the 
variance estimators, and the sample correlation coefficient 
between the Rao’s adjusted jackknife variance estimator 
and the new variance estimator based on the 5,000 samples. 
The relative bias of V as an Ee of the variance of y, 
is calculated by [Var, (y,)]- bie ( aN Var, (y,)], where 
the subscript B denotes the distribution generated by the 
Monte Carlo simulation. The correlation coefficients of the 
two variance estimators are computed to give a measure the 
relative linearity behavior of the two variance estimators. 


Table 2 
Relative Bias of the Variance Estimator, Standard Error 
of the Relative Bias, and Sample Correlation Coefficient 
Between the Rao’s Variance Estimator and the 
New Variance Estimator Based on 5,000 Samples 


Response Imputation Rel. Bias x 100 (S.E. x 100) Corr. 
Rate(p) Method Naive Rao New Coeff. r 

M1 -17.40 (2.02) 1.61 (2.03) 1.70(2.04) 0.967 

0.9 M2 ~~ -17.50(2.00) 1.41 (2.01) 0.81 (2.03) 0.974 

M3 ~-18.03 (2.03) 1.16(2.05) 1.15 (2.04) 1.000 

M1 -34.45 (2.01) 0.65 (2.03) 0.49(2.05) 0.939 

0.8 M2 ~— -32.89 (2.01) 2.49 (2.04) 0.19(2.03) 0.947 

M3 ~~ -34.96 (2.01) 1.59 (2.03) 1.59(2.03) 1.000 

Mi -48.96 (2.01) 0.21(1.99) 0.41 (2.04) 0.912 

0.7 M2 ~~ -44.76 (2.02) 5.31(2.05) 0.76(2.05) 0.920 

M3 ~—_--50.21 (2.02) 1.53(2.05) 1.52 (2.04) 1.000 

M1 -59.80 (2.02) 1.58(2.05) 1.27(2.06) 0.892 

0.6 M2 ~~ -54.86 (2.03) 7.10(2.07) -0.75 (2.07) 0.899 

M3 ~~ -64.11 (2.00) -0.35 (2.04) -0.35(2.01) 1.000 

M1 -69.75 (1.99) 0.84 (2.03) 1.12 (2.03) 0.873 

0.5 M2 ~~ -59.90 (2.01) 15.07 (2.07) 2.27 (2.06) 0.872 

M3 ~~ -74.44 (1.97) 1.99(2.00) 1.98(2.00) 1.000 
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Table 2 supports our theory in the following ways. 


1. As is well known, the naive variance estimator seriously 
underestimates the true variance. The adjusted jackknife 
variance estimator performs well for [M1] and [M3], but 
not for [M2]. The theory for the adjusted jackknife 
method assumes that hot deck imputations are done 
using the with-replacement selection which is not used 
in [M2]. As the response rate decreases in Table 2, the 
relative bias of the adjusted jackknife becomes larger. 


2. The new method based on the pseudo data performs 
well even for the without-replacement imputation [M2]. 
As was discussed at the end of section 3, a single 
formula (29) can be used as the pseudo data for a large 
class of imputation methods. 


3. As is observed in the correlation coefficients, the 
behaviors of the adjusted jackknife variance estimator 
and the proposed variance estimator are very similar for 
mean imputation [M3]. This is because the two variance 
estimators are asymptotically equivalent, as discussed in 
section 5. 


7. CONCLUDING REMARKS 


We have described methods of making pseudo data to be 
used for variance estimation. Generally speaking, the 
pseudo data can be described as 


Pols 2.eers 


y; Perr ls t,t 
y; + C8; (y; -¥) 


(39) 


where ¥, is the predicted value of y, under the model used 
for imputation. If cg; = 1, then the variance estimator treats 
the imputed values as observations. A suitable choice of 
C;8;>1 leads to a consistent variance estimator. If the 
imputation method is deterministic and the respondents are 
regarded as a random sample from the original sample, then 
C; = r-'n>1. For a two-phase sampling with a complex 
design, c, = w; w; , where w, is the sampling weight of the 
unit i for the first-phase sample and w, is the sampling 
weight of the unit 7 for the second-phase sample. 

The g; in (39) is the adjustement made to improve the 
conditional properties given the auxiliary variable x. For 
ratio imputation, 


Se Co age a 


where X, = %;_,w;, x, and X, = L;_,w,x,. For regression 
imputation with scalar x, 


. =i 
&; = 1+, -X,) {Ss wie 207} (x; -X,). 
k=l 
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In either case, we have 
Le 
* _s 
dow; 8; %, = Fy. 
i=l 


While this paper was under review, Shao and Steel 
(1999) also provided similar methods in the case of 
deterministic imputation. Our method is more general in the 
sense that we also considered random imputation and 
introduced c, term to improve finite sample properties. 
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APPENDIX 
A. Proof of Equation (10) and (12) 


The estimator fi, in (9) can be written as 


n r 

an Dye eae (A.1) 
i=1 i= 

where d, is the number of times that unit i is used as a 


donor. Under the equal probability and with-replacement 
imputation mechanism, we have 


E,(d;) =r ‘lm 
and 
riers ye Galtoneag 
Cov, (d;,d;) = 
-r?m if 14+] 


where the subscript 7 denotes the variation due to the 
imputation mechanism. It follows that E,(f,) =n '2}., 9, 


and V,(fi,) =n *r-'mY;_, é; . Hence, 


r 


V(,) = v|n'S sa[ntrn a] (A.2) 
et 


Now, by an similar argument similar to the one leading to 
(2), we have 


i=] 
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Var} n™ i =[n7R? +r -R?)Jo. (A.3) 
j=] 


i 


Since: P= y, =; =x,) B+ o,(1), we apply classical 
regression theory to get 


E real = (1-R?)o, (A.4) 
i=l 
and 
E\(n - Ye -¥,)| = R? 05. (A.5) 


Therefore, (10) is proved and the estimator in (12) is 
consistent for the variance in (10). 


B. Validity of (15) Under the Without-Replacement 
Imputation Mechanism 


We assume that m = kr + t where k and t are nonnegative 
integers and t<r. Let the estimator of the mean of y have 
the form (A.1). Let the imputation be performed such that 
t of the respondents are used k + 1 times for imputation and 
r-t units are used k times for imputation. The ¢ of the 
respondents that are used k + 1 times are chosen by simple 
random sampling without replacement. Then, 


E,(d;) = karst =am 
and 
reid =rat if i=j 
Cov (d;, d;) = ‘ 
-r“t if i #7. 


So, by similar arguments as in the proof of (A.2), we have 


Via,) = VG) sa nary é; ey) 
Hence, using (A.3) and (A.4), we have 
Vif} =[n7R?+@ +n) (1-R%»J0?. (B.2) 


Now, conditional on the realized sample and the 
respondents, we have 


sire EA 


so that V{y1,} in (15) satisfies 


Kim: Variance Estimation After Imputation 


E,(V{u,}) sn ‘(n- veo> (Y; sshd 


+ frien ?ta -r a] 


(r-p)! » CBs 


Therefore, using (A.4) and (A.5), we have the approximate 
unbiasedness of the Vin, } under the without-replacement 
imputation mechanism. 


C. Proof of Equation (26) 


First, define Rk = x.) - x) (B - B) 
(x, - X,) (B - B). From the equality (25), 


and R,, = 


L 
¥ = do o,(57 -5,) = A, +B, +2C, 


= 


where A, ie 5c, ip = Jnlon Bago nek eR, Y. 

andeiGu= eke oe ¥,) (RY -R,). Hence, by the 
assumption (20), (26) follows because A,=O,(n ayy 
B,=0,(n"), and C, = o,(n 1), The last property comes 
from the Cauchy- -Schwartz inequality, G. SAB. 
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A Multivariate Technique for Multiply Imputing Missing Values 


Using a Sequence of Regression Models 


TRIVELLORE E. RAGHUNATHAN, JAMES M. LEPKOWSKI, JOHN VAN HOEWYK 
and PETER SOLENBERGER’ 


ABSTRACT 


This article describes and evaluates a procedure for imputing missing values for a relatively complex data structure when 
the data are missing at random. The imputations are obtained by fitting a sequence of regression models and drawing values 
from the corresponding predictive distributions. The types of regression models used are linear, logistic, Poisson, 
generalized logit or a mixture of these depending on the type of variable being imputed. Two additional common features 
in the imputation process are incorporated: restriction to a relevant subpopulation for some variables and logical bounds 
or constraints for the imputed values. The restrictions involve subsetting the sample individuals that satisfy certain criteria 
while fitting the regression models. The bounds involve drawing values from a truncated predictive distribution. The 
development of this method was partly motivated by the analysis of two data sets which are used as illustrations. The 
sequential regression procedure is applied to perform multiple imputation analysis for the two applied problems. The 
sampling properties of inferences from multiply imputed data sets created using the sequential regression method are 
evaluated through simulated data sets. 


KEY WORDS: Item nonresponse; Missing at random; Multiple imputation; Nonignorable missing mechanism; 


Regression; Sampling properties and simulations. 


1. INTRODUCTION 


Incomplete data is a pervasive problem faced by most 
applied researchers. Several methods have been, and 
continue to be, developed to draw inferences from data sets 
with missing values (Little and Rubin 1987). The multiple 
imputation framework suggested by Rubin (1978, 1987a, 
1996) is an attractive option if a data set is to be used by 
multiple researchers with differing levels of statistical 
expertise. This approach involves imputing several 
plausible sets of missing values in the incomplete data set 
resulting in several completed data sets. Each completed 
data set is analyzed separately, say by fitting a particular 
regression model. The resulting inferences — point estimates 
and the covariance matrices — are then combined using the 
formula given in Rubin (1987a, Chap. 3) and refinements 
thereof (Li, Raghunathan and Rubin 1991; Li, Meng, 
Raghunathan and Rubin 1991; Meng and Rubin 1992; and 
Barnard 1995). 

Imputation based approaches for handling missing data, 
in general, are quite useful in practice because once the 
missing values have been imputed, existing complete-data 
software can be used to analyze the data. Since software 
development for complete data analysis is keeping pace 
with the introduction of new statistical methods, applied 
researchers without knowledge of particular missing data 
techniques or resources to generate their own code for 
implementing new missing data procedures will be able to 
fit finely tuned substantive models for a specific problem at 


hand. An added advantage of the multiple imputation 
approach is that by repeatedly applying the complete data 
software, one can obtain valid point and interval estimates 
under a fairly general set of conditions (Rubin 1987a). 
Several researchers (see, for example, the list of references 
in Rubin 1996) have applied this technique under a variety 
of settings and have demonstrated, through analysis of 
simulated and actual data sets, the appropriateness of this 
approach. Alternatives such as single imputation with an 
appropriate variance estimation procedure, for example, 
modified Jackknife Repeated Replication Technique (Rao 
and Shao 1992) also have this advantage. The imputation 
approach described in this paper can also be used to create 
single imputation with an alternative variance estimation 
procedure. 

The development of imputation methods from varying 
perspectives has a long history (Madow, Nisselson, Olkin 
and Rubin 1983). A theoretically appealing framework for 
developing imputation methods is the Bayesian approach. 
This approach specifies an explicit model for variables with 
missing values, conditional on the fully observed variables 
and some unknown parameters, a prior distribution for the 
unknown parameters, and a model for the missing data 
mechanism, which does not need to be specified under an 
ignorable missing data mechanism (Rubin 1976). This 
explicit model then generates a posterior predictive distri- 
bution of the missing values conditional on the observed 
values. The imputations are draws from this posterior pre- 
dictive distribution. Several computer programs and 
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algorithms are available for imputing missing values under 
multivariate normality (Rubin and Schafer 1990), the 
multivariate t distribution (Liu 1995), and several variations 
of the general location model (Schafer 1997; Raghunathan 
and Grizzle 1995; and Raghunathan and Siscovick 1996). 
The latter model can handle the joint distribution of 
categorical and continuous variables and was first proposed 
by Olkin and Tate (1961), and used by Little and Schluchter 
(1985) explicitly for missing data problems. An important 
property of these approaches is that they are fully condi- 
tional on all the observed information. Several simulation 
studies (for example, Raghunathan and Grizzle 1995) indi- 
cate that the inferences drawn from such imputed data have 
desirable sampling properties. 

Survey data sets often consist of large numbers of 
variables which have a variety of distributional forms. 
Typically, such data sets have hundreds of variables, some 
continuous, others counts, many dichotomous or poly- 
tomous, and even some semi-continuous or limited 
dependent variables. Moreover, the distributions of the 
continuous variables alone may involve normal, lognormal, 
and other distributions. Postulating a full Bayesian model 
can be very difficult in this situation. Furthermore, survey 
data commonly have two additional features that make the 
modeling process even more complex. First, certain 
restrictions are imperative. For example, the variable 
“Number of Years Since Quit Smoking” is defined only for 
former smokers; hence, the imputation process for this 
variable should be restricted only to former smokers. 
Restrictions also arise due to skip patterns in the question- 
naire. For example, certain questions about income from a 
second job are asked only when the respondent indicates 
that he/she has a second job. The imputation of such 
variables has to be handled in a hierarchical manner. 

Second, there are certain logical or consistency bounds 
for the missing values that must be incorporated in the 
imputation process. Such interrelationships among the 
variables make the model specification difficult. For 
instance, “Years of Smoking” is restricted to current or past 
smokers and the imputed values must be less than Age — x 
years, where x may be chosen based on certain other 
characteristics, such as evidence of smoking as a teen-ager. 
For a former smoker, x also includes years since smoking 
ceased. Another example of bounds is discussed in 
Heeringa, Little and Raghunathan (1997). They address 
imputation of bracketed response questions in which a 
respondent is unable or unwilling to provide an exact 
response (eé.g., income and assets), but does define the 
bounds within which the imputed values must lie. 

The goal of this paper is to propose and evaluate a 
general purpose multivariate imputation procedure that can 
handle a relatively complex data structure where explicit 
full multivariate models cannot be easily formulated but the 
imputed values for each individual are fully conditional on 
all the values observed for that individual. The approach is 
to consider imputation on a variable by variable basis but to 


condition on all observed variables. The basic strategy 
creates imputations through a sequence of multiple 
regressions, varying the type of regression model by the 
type of variable being imputed. Covariates include all other 
variables observed or imputed for that individual. The 
imputations are defined as draws from the posterior 
predictive distribution specified by the regression model 
with a flat or non-informative prior distribution for the 
parameters in the regression model. The sequence of 
imputing missing values can be continued in a cyclical 
manner, each time overwriting previously drawn values, 
building interdependence among imputed values and 
exploiting the correlational structure among covariates. To 
generate multiple imputations, the same procedure can be 
applied with different random starting seeds or taking every 
P" imputed set of values in the cycles mentioned above. 

The variables in the data set are assumed to be of the 
following five types: (1) continuous, (2) binary, (3) catego- 
rical (polytomous with more than two categories), (4) 
counts and (5) mixed (a continuous variable with a non-zero 
probability mass at zero). Computationally, binary and 
categorical variables can be treated identically, but distin- 
guishing them helps in conceptual understanding and in the 
description of the basic algorithm. We also assume that the 
population is essentially infinite, the sample is a simple 
random sample and the missing data mechanism is 
ignorable (Rubin 1976). The use of multiple imputation in 
a complex design setting has, as yet, not been fully 
investigated and is beyond the scope of the current paper. 

In this paper we describe the sequential regression 
multivariate imputation (SRMD) approach in section 2 and 
evaluate two applications of the approach in sections 3 and 
4. In the first application, it is difficult to postulate a joint 
multivariate distribution because of the complex systematic 
relationship between the variables and restrictions. In the 
second application, a general location model can be used to 
create multiple imputations (Olkin and Tate 1961; and 
Little and Schluchter 1985). Hence, we compare multiple 
imputation inferences resulting from the SRMI approach to 
those resulting from a joint multivariate model. The results 
of a simulation study investigating the sampling properties 
of imputed data inferences are presented in section 5, and 
a concluding discussion with directions for future research 
are given in section 6. 


2. IMPUTATION METHOD 


For a sample of size n, let X denote a n x p design or 
predictor matrix containing all the variables with no missing 
values. X consists of continuous, binary, count or mixed 
variables, and appropriate dummy variables representing 
categorical variables. In addition, X may also consist of a 
column of ones to model an intercept parameter, offset 
variables, and certain design variables. Let LGR) Gee): 
denote k variables with missing values, ordered, without 
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loss of generality, by the amount of missing values, from 
least to most. The pattern need not be monotone. (In a 
monotone pattern of missing data, Y, is observed only for 
a subset of subjects on whom Y, is observed, Y, is 
observed only for a subset of those on whom Y, is observed 
and so on.) 

For model based imputations, the joint conditional 
density of Y,,Y,,..., Y, given X can be factored as 


CR A Ab oe eae a 
HPO GO. SCAG Cn. Gate less 
FAs eA? GRA) (1) 


where f., 7 =1,2,...,k are the conditional density functions 
and @. is a vector of parameters in the conditional distri- 
bution (e.g., regression coefficients and dispersion para- 
meters). In the sample survey context this can be viewed as 
a superpopulation model. We model each conditional 
density through an appropriate regression model with 
unknown parameters, 6., and draw from the corresponding 
predictive distribution of the missing values given the 
observed values. We assume that the prior distribution for 
the parameters 0 =(0,,6,,...,0,) is 7(0)«1 (diffuse relative 
to the likelihood). However, the method can easily be 
modified for specified proper prior distributions. 

Each conditional regression is based on one of the 
following models: 


1. A normal linear regression model on a suitable scale 
(for example, a Box-Cox power transformation may be 
used to achieve normality) if Y, is continuous; 


. A logistic regression model if Y; is binary; 
3. A polytomous or generalized logit regression model if Y; 
categorical; 
A Poisson loglinear model if Y; is a count variable; and 


5. A two-stage model where zero-non zero status is 
imputed using logistic regression, and conditional on 
non-zero status, a normal linear regression model is 
used to impute non-zero values, if Y, is mixed. 

& 

Each imputation consists of c “rounds”. Start round | by 
regressing the variable with the fewest number of missing 
values, Y, on X, imputing the missing values under the 
appropriate regression model. Assuming a flat prior for the 
regression coefficients, the imputations, for the missing 
values in Y, are the draws from the corresponding posterior 
predictive distribution (See Appendix A for a detailed 
discussion about drawing values for various regression 
models.) Then update X by appending Y, appropriately 
(for example, dummy variables, if it is categorical) and 
move on to the next variable, Y,, with the next fewest 
missing values. Repeat the imputation process using 
updated X as predictors until all the variables have been 
imputed. That is, Y, is regressed on U =X; Y, is regressed 
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on U=(X,Y,) where Y, has imputed values; Y, is 
regressed on U =(X,Y,,Y,) where Y, and Y, have imputed 
values; and so on. 

The imputation process is then repeated in rounds 2 
through c, modifying the predictor set to include all Y 
variables except the one used as the dependent variable. 
Thus, regress Y, on X and Y,, Mee ay, ¥% regress Y, on X and 
Y,,Y,,-..,Y,; and so on. Repeated cycles continue for a 
prespecified number of rounds, or until stable imputed 
values occur. 

The procedure outlined above needs modification to 
incorporate restrictions and bounds. The restrictions are 
handled by fitting the models to an appropriate subset of 
individuals. For example, a Poisson regression model could 
be applied to impute any missing values for the variable 
“Number of Pregnancies.” The imputation will be restricted 
to women in the sample. As a covariate, though, this 
variable may be treated differently when imputing subse- 
quent variables. For instance, certain dummy variables may 
be created based on this variable, which hare then appended 
to the matrix U before proceeding with the imputation of 
the next variable. 

Consider another example, “Years Smoking Cigarettes,” 
where the sample would be restricted to current or past 
smokers. If there is no evidence of smoking as a teenager, 
“Years Smoking Cigarettes” for a current smoker should 
satisfy the bound (0, Age - 18). If there is some indication 
of smoking as a teenager then the range may be restricted 
to, say (0, Age - 12). For a past smoker these ranges will be 
(0, Age - 18 - YRSQUIT) and (0, Age - 12 - YRSQUIT) 
respectively, where YRSQUIT is the years since the indivi- 
dual quit smoking. The appropriate regression model for 
this variable is a truncated version of the normal linear 
regression model (possibly on a transformed scale). The 
parameters, the regression coefficients and the residual 
variance need to be drawn from the corresponding posterior 
distributions. The imputations are then drawn from the 
corresponding truncated normal distribution conditional on 
the drawn value of the parameters. 

It is difficult to draw values of parameters directly from 
their posterior distribution with truncated normal likeli- 
hoods. However, it can be easily computed for a given 
parameter value. The Sampling-Importance-Resampling 
(SIR) algorithm (Rubin 1987b, Raghunathan and Rubin 
1988) can be used to draw from the actual posterior 
distribution. First, draw several trial parameter values from 
the posterior distribution without applying the bounds 
(untruncated normal linear regression model). Second, 
attach an importance ratio to each trial value, defined as the 
ratio of the actual posterior density with bounds to the trial 
density (the posterior density without bounds), both 
evaluated at the drawn value. Finally, resample a single 
parameter value with probability proportional to the 
importance ratios. This method requires careful monitoring 
of the distribution of importance ratios (Gelman, Carlin, 
Stern and Rubin 1995). 
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The bounds can also be applied to polytomous variables. 
For instance, suppose that a variable Y can take one of k 
values, but the observed data suggests that the missing 
value for a particular subject can either be j or /. The 
contribution to the likelihood from this subject corresponds 
to the conditional binomial distribution. The draws in the 
multinomial step (see Appendix A) are made from the 
conditional distribution for these two categories. That is, the 
imputed value is j with probabilities s,. = P,./ (P,. +P,+) 
and / with probability 1 -s;. 

At the completion of the initial round of imputations, the 
first complete data set with no missing values is available. 
The factorization in Equation (1) defines a joint conditional 
distribution of Y,,Y,,...,Y¥,, given X. If the pattern of 
missing data is monotone, the imputations in the first round 
are approximate draws from the joint posterior predictive 
density of the missing values given the observed values. 
Note that the draws from the logistic, polytomous, and 
count variables are from large sample approximations of the 
posterior density of the regression coefficients. It is possible 
to improve upon these approximations by using, for 
example, the SIR algorithm or another rejection algorithm 
in each subsequent round. 

When the pattern of missing data is not monotone, one 
can develop a Gibbs sampling algorithm (Geman and 
Geman 1984; Gelfand and Smith 1990) corresponding to 
Model (1). For example, conditional on the drawn values of 
the parameters 0,,0,,...,0, and the missing values drawn in 
the first round, the second round would draw values of 0, 
from the appropriate conditional posterior density which is 
proportional to the first term in Equation (1). Next draw the 
missing values in Y, conditional on this drawn value of the 
parameter 0,, all other observed or imputed values for that 
subject and other parameters, 0,, 0,, ...,8, in the model. That 
is, the missing values in Y, at round (t+1) need to be 
drawn from the conditional density, 
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computed based on the joint distribution in (1), where Y, ev 
is the imputed or observed values for variable Y, at round 
t. Though this is conceptually possible, it is difficult even to 
compute this density in most practical settings with restric- 
tions, bounds, and the types of variables being considered. 

Our proposal is to draw missing values in Y, at round 
(t¢+1) from a predictive distribution corresponding to 
conditional density, 


(t+1) te 
8 (YY DOveR Y,,X, 9), (3) 


where the conditional density 8; is specified by one of the 
regression models described earlier that depends upon the 
variable type for Y;, and @, is the unknown regression para- 
meters with diffuse prior. That is, the new imputed values 
for a variable are conditional on the previously imputed 
values of other variables, and the newly imputed values of 
variables that preceded the currently imputed variable. This 
proposal may be viewed as an approximation to an actual 
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Gibbs sampling where the conditional density (2) is 
approximated by the conditional density (3). Furthermore, 
this approximation can be improved by considering the SIR 
or some other rejection type algorithm if the conditional 
density in (2) can be computed up to a constant. 

There are some other particular cases where this approxi- 
mation is equivalent to drawing values from a posterior 
predictive distribution under a fully parametric model. For 
example, if all the variables are continuous and each condi- 
tional regression model is a normal linear regression model 
with constant variance, then the algorithm converges to a 
joint predictive distribution under a multivariate normal 
distribution with an improper prior for the mean and the 
covariance matrix. 

It is theoretically possible that a sequence of draws based 
on densities in (3) may not converge to a stationary 
distribution, because these conditional densities may not be 
compatible with any multivariate joint conditional distri- 
bution of Y,,Y,,...,Y, given X (Gelman and Speed 1993). 
Our empirical investigations using several practical data 
sets have not identified, so far, any such anomalies. In 
several large data sets, we find the conditional densities (2) 
and (3) to be quite similar. As discussed in sections 4 and 5, 
the draws from this approach are comparable to those based 
on an explicit Bayesian model. 


3. EFFECT OF SMOKING ON PRIMARY 
CARDIAC ARREST 


In our first illustration, the SRMI approach is applied to 
a case-control study examining the relationship between 
cigarette smoking and the incidence of primary cardiac 
arrest (Siscovick, Raghunathan, King, Weinmann, 
Wicklund, Albright, Bovbjerg, Arbogast, Kushi, Cobb, 
Copass, Psaty, Retzlaff, Childs and Knopp 1995). In this 
study it is difficult to formulate an explicit model which 
captures the full complexity of the data. The case subjects 
were all King County, Washington residents who had out- 
of-hospital primary cardiac arrests between 1988 and 1994. 
The case subjects were identified through a review of 
paramedic incident reports. Control subjects were selected 
by random digit dialing from King County and matched to 
case subjects on gender and age (within seven years). To be 
eligible, subjects (case and control) were required to be 
between 25 and 74 years of age, married, and free of 
clinically-diagnosed heart disease or some other life- 
threatening conditions such as cancer, liver disease, lung 
disease, or end-stage renal disease. 

Because primary cardiac arrest has a case-fatality rate 
greater than 80%, the eligibility criterion of marriage was 
included so that information regarding risk factor exposure 
(i.e., smoker status, years smoked) could be ascertained 
from surrogate respondents (i.e., spouses). Among control 
and surviving cases subjects, both subject and surrogate 
were interviewed to gather exposure data. The control and 
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the surviving cases subjects were interviewed mainly to 
study the reliability of measurements from their surrogates. 
Among the variables considered in this paper, there were 
practically no differences in the measurements obtained 
from the subjects and their surrogates for control or case 
subjects. 

Table 1 gives the means, standard deviations, and 
percent missing values for key variables by case-control 
status. The exposure variables are indicator variables for 
Former Smoker (X,), Current Smoker (X,) and Years 
Smoked (X,). The confounding variables considered are 
Age, Body Mass Index (BMI) (BMI=Weight [in 
Kg]/Height’[in Meters]), and the binary variables Female 
and Education (High School Graduate). The substantive 
model of interest is the logistic regression model, 


log [Pr(C =1)/Pr(C =0)] =a, +a,X, +a, X, +a, X, X, 
+a, X, X, +0, Age +a, BMI 
+O, Female + a, Education, 


where C is an indicator of cardiac arrest. Preliminary 
investigations indicated that linear terms for Age and BMI, 
are appropriate. 


Table 1 
Means and Proportions (in %) for Key Variables and 
Percent Missing 


Variable Control (n=551) Cases (n=347) 
% Missing Mean(SD) % Missing Mean (SD) 
Age 0.0 58.4 (10.4) 0.0 59.4 (9.9) 
BMI 8.2 25.8 (4.1) 2.6 26.4 (4.6) 
Years Smoked 16.8 24.8 (14.7) 5.4 31.7 (13.8) 
Proportion Proportion 
Female 0.0 2322 0.0 19.9 
2 High School 0.0 76.8 0.0 61.9 
Smoking Status 
Never Smoked 0.0 47.2 0.0 213 
Former Smoker 0.0 42.1 0.0 38.2 
Current Smoker 0.0 10.7 0.0 34.5 


There are no missing values for the variables Age, 
Female, Education, Smoking Status (X,, X,), and C. Thus, 
for purposes of imputation, define X = (1, Age, Female 
Education, X,,X,,C). Log (BMI), having the fewest 
missing values, was regressed first on X through a normal 
linear regression model. Residual diagnostics indicated a 
log-transform improved the normality of residuals. 

Next, Years Smoked was regressed on U = (X, log 
(BMJ)). For this variable the sample was restricted to 
current and former smokers. Moreover, imputed values for 
Years Smoked were bounded by AGE-18, unless a 
respondent reported that they smoked in school 
(SCHSMK), and then they were bounded by AGE-12. For 
former smokers, imputed values were also bounded by how 
long ago the respondent had quit smoking (YRSQUIT). 
Thus, imputed values for former smokers who did not 
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smoke in school were bounded by AGE-18-YRSQUIT, 
while imputed values for former smokers that did smoke in 
school were bounded by AGE-12—YRSQUIT. Some 
subjects (5%) had missing values on the two auxiliary items 
(SCHSMK, YRSQUIT) which were imputed prior to 
defining the upper bounds of Years Smoked. The inherent 
structure of this data set makes it difficult to develop 
explicitly a joint distribution of the variables with missing 
values conditional on the completed observed variables. 
SRMLI is thus an appealing approach to handle for this type 
of data. 

In imputing the missing values, we performed 1,000 
rounds for each of 25 different starting random seeds 
resulting in M = 25 imputations. The logistic regression 
model was fit to each imputed data set to obtain maximum 
likelihood estimates of the regression coefficients and 
asymptotic covariance matrices. 

We used the standard multiple imputation variance 
formula (Rubin 1987a, Chap. 3) to compute the multiply 
imputed estimate of the regression coefficients and the 
covariance matrix. Briefly, suppose that @” is the estimate 
of the vector of regression coefficients a in the logistic 
model, and V“ its covariance matrix, based on imputed 
data set /. The multiply imputed estimate of a is 

M 
Gay = a, a /M 
1=1 
and its covariance matrix is 
M 
Vie = VOM+21 B 
ied M 


M 
where 


M 
By =>, (@ - 9) (0 - O4,)'/(M - 1) 


~ 
" 
_ 


The number of imputations is larger than what is usually 
recommended. We performed 25 imputations with different 
random seeds to assess whether the Gibbs style rounds lead 
us to a region of the imputed values that is very different 
from the observed data. Graphical displays of the imputed 
and observed values indicated that none of the imputations 
in the 25,000 rounds were incompatible with the observed 
data distribution. 

Table 2, the complete-case analysis, gives the point esti- 
mates and their standard errors based on subjects with all 
variables observed. A total of 103 subjects (11.5%) had 
missing values in one or more predictors. A complete-case 
analysis, which is generally valid only when the data are 
missing completely at random was performed after deleting 
these 103 subjects (See Column 2, Table 2). Logistic 
regression analyses with a missing data indicator as the 
dependent variable and a number of completely observed 
variables as predictors indicated that the data are not 
missing completely at random. One may expect, therefore, 
that the complete case estimates and standard errors are 
biased. 


90 


Raghunathan et al. : A Multivariate Technique for Multiply Imputing Missing Values 


Table 2 
Point Estimates (Standard Errors) of Logistic Regression Coefficients for Model of Primary Cardiac Arrest for Complete Cases, 
SRMI Methods 1* and 2** 


Predictor Variables Complete Case 
(n=795) 
Estimate (SE) 

Intercept -2.922 (0.791) 
Age 0.015 (0.009) 
Female -0.007 (0.203) 
Education -0.448 (0.173) 
BMI 0.056 (0.018) 
Current Smoker 1.693 (0.569) 
Former Smoker 0.003 (0.284) 
Current Smoker x Yrs Smoked -0.003 (0.015) 
Former Smoker x Yrs Smoked 0.019 (0.009) 


* Method 1 — Imputation restricted to model variables 


SRMI 
Method 1 (n=898) Method 2 (n=898) 
Estimate (SE) Estimate (SE) 

-2.610 (0.757) -2.348 (0.627) 

0.015 (0.009) 0.014 (0.008) 
-0.115 (0.189) -0.119 (0.177) 
-0.467 (0.166) -0.444 (0.133) 

0.049 (0.013) 0.055 (0.009) 

2.001 (0.543) 1.998 (0.448) 
-0.029 (0.262) -0.011 (0.223) 
-0.008 (0.013) -0.005 (0.011) 

0.014 (0.009) 0.014 (0.009) 


** — Method 2 — Imputation includes model and auxiliary variables 


Table 2, SRMI Method 1, gives estimates and their 
standard errors for SRMI using only the variables in the 
substantive model. These estimates are quite similar to the 
complete-case analysis estimates. The multiple imputation 
standard errors are smaller due to additional subjects with 
imputed data. There are modest changes in the relationship 
between smoking and primary cardiac arrest. The complete- 
case analysis indicates a statistically significant relationship 
between years smoked and primary cardiac arrest for former 
smokers, while no such association is indicated in the 
analysis of multiply imputed data. 

One of the advantages of the multiple imputation 
’ approach is that the imputation process can use additional 
variables not in the substantive analysis. Such situations 
arise when a common research database with many 
variables is used by different researchers, each using a 
subset of the variables. The imputation may be carried out 
for the entire database, where prediction for missing values 
in each variable borrows strength from all other variables in 
the data set. Such imputations have been shown to improve 
efficiency compared to those based only on variables in the 
particular substantive model (Raghunathan and Siscovick 
1996). 

Table 2, SRMI Method 2, provides multiple imputation 
estimates and their standard errors obtained when the entire 
data set was imputed using 50 additional variables. These 
included dietary indicators, physiological measures, socio- 
economic status, and behavioural variables. The point 
estimates are modestly different for all the variables. The 
standard errors, though, are considerably smaller when 
compared to the multiple imputation approach using only 
variables in the substantive model (SRMI, Method 1). This 
is not surprising because many of the additional variables 
such as blood pressure, cholesterol counts, alcohol con- 
sumption, and physical activity were highly predictive of 
BMI and smoking related variables. 


4. PARENTAL PSYCHOLOGICAL DISORDERS 
AND CHILD DEVELOPMENT 


A second illustration examines the effects of parental 
psychological disorders on several measures of childhood 
development. Little and Schuchter (1985) analyzed the data 
using a general location model to obtain maximum likeli- 
hood estimates of the parameters of the joint distribution. 
This general location model was employed to create 
multiple imputations using Markov Chain Monte Carlo 
methods (Schafer 1997), producing fully Bayesian model- 
based multiply imputed data sets. We also created multiple 
imputations using the SRMI procedure. 

The study data consists of 69 families with two children 
each. Each family was classified into one of the three risk 
categories: (1) Normal Risk — no parental psychiatric 
disorders; (2) Moderate Risk — one parent diagnosed with 
a psychiatric illness or a chronic physical illness; and (3) 
High Risk — one parent diagnosed with schizophrenia or an 
affective mental disorder. There are three primary depen- 
dent variables of interest: Y,., number of psychiatric 
symptoms (dichotomized as high/low) for child c; Y,., the 
standardized reading scores for child c; and Y,., the 
standardized verbal comprehension score for child c. 

We consider three models in investigating the impact of 
parental psychological disorders on childhood develop- 
ment. The first is a mixed effects logistic regression model: 


logit[Pr(¥, ;, = 1)] = By +B, Uy; +B, Uys * Yi 


where Y,,.= lif child c in family i is classified as having a 
high number of symptoms and 0 otherwise; U,,=1 if 
family i is classified as a moderate risk group and 0 
otherwise; U,,=1 if family i is classified as a high risk 
group and 0 otherwise; and y, are random effects assumed 
to be identically and independently distributed normal 
random variables with mean O and variance ®,. This 
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random effect accounts for intraclass correlation between 
the two children within the same family. With complete 
data, this model may be fit by maximizing the numerically 
integrated likelihood function of (B,, B,, B,, ®, ) using the 
Newton-Raphson algorithm and the Gaussian quadrature 
method for the numerical integration of the likelihood 
function. These types of models can be easily fit with 
complete data, but are difficult to fit with missing data. 

The second and third regression models relate the child’s 
reading and verbal scores, respectively, to risk group after 
adjusting for the number of symptoms (Y,). An investi- 
gation of the residuals after a few preliminary rounds or 
reading and verbal score imputations indicated a log scale 
was appropriate. Thus, denoting Y,,, and Y,,. as the 
logarithm of the reading and verbal scores, respectively, for 
child c in family i, we posited the following mixed effects 
regression model, 


Y, ic 


= G40, U,;+ a,U,;+0,¥,;, +5; +€;.- 

where 6, and €,_ are mutually independent normal random 
variables with mean O and variances o; and o, respec- 
tively. Again, with no missing data in the covariates, the 
maximum likelihood estimates of the unknown parameters 
can be readily obtained using, for example, the PROC 
MIXED procedure in SAS. 

There were no missing values in the classification of the 
risk groups, and thus we defined X=(1,U,,U,). The 
variables with missing values, Y,,,Y,,,Y,, and Y,, were 
imputed using normal linear regression, and the missing 
values in Y,, and Y,, were imputed using logistic 
regression. We created M=25 SRMIs, repeating the process 
through 1,000 rounds and 25 different seeds. The SRMI 
multiply imputed data sets were analyzed and combined 
using the methods described earlier. To compare these 
results with the multiply imputed inferences when the 
imputations are draws from the posterior predictive distri- 
bution under the general location model we created 25 
imputations under a fully Bayesian model using software 
developed by Schafer (1997). The point estimates and 
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standard errors for the three models using SRMI and Bayes 
multiple imputation approaches are presented in Table 3. 
There are no real meaningful differences between the SRMI 
estimates and standard errors and those resulting from the 
Bayesian imputation. Children of parents in the high risk 
group are approximately 7.8 [exp (2.048)] times more likely 
to have a high number of symptoms than children with 
parents in the normal group under the SRMI. The 95% 
confidence interval for this relative risk is (3.8, 16.0). For 
the moderate risk, group, the corresponding point and 
interval estimates are 3.7 and (1.8, 7.8). These estimates 
may be contrasted with those obtained based on the 
complete-case analysis (not shown): 7.4 (2.3, 24.2) for the 
high risk group, and 3.5 (1.0, 11.9) for the moderate risk 
group (data not shown). Though the point estimates of the 
relative risks are similar, the complete-case confidence 
intervals are wider because they are based only on 60% of 
the observations. 

Based on the estimated regression coefficients in Table 
3, one can infer, after adjusting, for the number of 
symptoms, that children in the moderate and high risk 
groups have lower reading scores, by about 11 points [exp 
(4.654)-exp(4.654-0.110)], when compared to the normal 
group. On the other hand, the complete-case analysis 
estimates a score of 16 points lower for children in the 
moderate risk group than their counterparts in the normal 
group, and children in the high risk group score about 19 
points lower when compared to the normal group. 

The SRMI analysis of verbal scores suggests that the 
children in the moderate and high risk groups score about 
20 and 24 points lower, respectively, than their counterparts 
in the normal group. However, the complete-case analysis 
shows the moderate risk group scores lower by 36 points 
and the high risk group scores lower by about 39 points 
when compared to the normal group. Thus, the complete- 
case estimates of the effects of parental psychological 
disorders on the child’s reading and verbal scores are quite 
different than those obtained by the analysis of the multiply 
imputed data. This is not surprising because the data on 
reading and verbal scores are not missing completely at 


Table 3 
Point Estimates (Standard Errors) of Regression Coefficients for Three Models of Child Development Under 
SRMI and Bayesian Imputation 


Predictor Variables Imp. Method Dependent Variable 
Symptoms Reading Score Verbal Score 
Intercept SRMI -0.678 (0.256) 4.654 (0.013) 4.873 (0.020) 
Bayes -0.688 (0.257) 4.556 (0.013) 4.991 (0.021) 
High Risk Group SRMI 2.048 (0.356) -0.109 (0.022) -0.191 (0.032) 
Bayes yAVES) (0.350) -0.108 (0.021) -0.180 (0.033) 
Moderate Risk Group SRMI 1.289 (0.366) -0.110 (0.022) -0.162 (0.033) 
Bayes 1.300 (0.360) -0.109 (0.023) -0.167 (0.035) 
Symptoms SRMI - 0.032 (0.022) -0.083 (0.032) 
Bayes i 0.031 (0.019) -0.080 (0.030) 
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random and are related to the risk group as well as the 
number of symptoms of the child. 


5. SIMULATION STUDY 


The analyses described in sections 3 and 4 indicate that 
sensible results can be obtained by applying the SRMI 
approach to handling missing values. Nevertheless, it is 
difficult to conclude based on such case studies whether or 
not the approach will result in valid inferences in routine 
applications. A simulation study was designed to investi- 
gate the repeated sampling properties of inferences from 
imputed data sets created with the SRMI approach. 
Complete data sets were generated from hypothetical popu- 
lations, and elements deleted under an ignorable missing 
data mechanism. The deleted values were imputed and 
differences in summary statistics based on the imputed data 
sets and the before deletion or full data sets were assessed. 

More formally, the strategy: 


(1) generated a complete data set which did not agree 
perfectly with our multiple imputation strategy, 

(2) estimated selected regression parameters, 

(3) deleted certain values using an ignorable missing data 
mechanism, 

(4) used SRMI to multiply impute the missing values, and 

(5) obtained multiply imputed estimates for the regression 
parameters estimated in step 2. 


The differences in the parameter are examined across 
several independent replications of this strategy. 
A total of 2,500 complete data sets with three variables 
‘(U, Y,, Y,) and sample size 100 were generated using the 
following models: 


1. U~ Normal (0, 1); 


2. Y,~ Gamma with mean p,= exp (U-1) and variance 
u3/5; and 


3. Y,~ Gamma with mean p,= exp (-1 + 0.5 U+0.5 Y,) 
and variance 3/2. 


The model for Y, in step 3 is the primary regression 
model of interest with true regression coefficients 
B, =-1, B, =B, =0.5, and dispersion parameter ~” =0.5. For 
the complete data this model can be fixed using statistical 
software packages such as GLIM or Splus. 

The deletion or missing data mechanisms were as 
follows: 


(1) No missing values in U; 

(2) the missing values inY, depend on U through a logistic 
function logit [Pr(Y, is missing)] = 1.5 + U; and 

(3) the missing values in Y, depend on U and Y, through a 


logistic function logit[Pr(Y, is missing)] = 1.5 - 
OS:Ye-05 U: 


These missing data mechanisms generated 22% missing 
datain Y, and 29% missing data in Y,. The complete-case 
analysis would have only used 48% of the data. 

Since SRMI allows us only to fit a normal linear 
regression model, the imputations were carried out as 
follows. Suppose that Y, has fewer missing values, and let 
Z,=(Y,'-1)/2, be the Box-Cox transformation of the 
continuous variable. In the first round of imputations, 
assume that Z, has a normal distribution with mean 
ay*a,U and variance 6;, where 4, was estimated using 
the, maximum likelihood approach, and that Z, = 
(Y,°-1)/A, has a normal distribution with mean 
beth, U b,Z, and variance 6,, where i. was estimated 
using maximum likelihood. In the subsequent rounds, U 
and Z, are predictors for Z,, and U and Z, are predictors 
for Z,. The estimation of a power transformation using 
maximum likelihood was automated while fitting each 
regression model. 

For each of the 2,500 simulated data sets with missing 
values, a total 250 rounds with M=5 different random starts 
were created using SRMI. For each replicate, the resulting 
M=S imputed data sets and the full data set (before deletion) 
were analyzed by fitting the Gamma model for Y, using 
maximum likelihood. The multiple imputation estimate was 
constructed as the average of the five imputed data esti- 
mates. To assess the differences in the point estimates we 
computed the standardized difference between the SRMI 
and full data estimates, 


100xabs(SRMI estimate — Full Data Estimate) 


A(B)= 
SE(SRMI Estimate). 

Table 4 gives the mean and standard deviation of A(B) 
for three regression coefficients B),B,, and B, in the model. 
The SRMI estimates are typically within 8% of the full 
standard units. The actual coverage and the average length 
of the 95% SRMI confidence intervals were computed for 
the regression coefficients using the t reference distribution 
described in Rubin (1987b). For each simulated data set and 
parameter, it was determined whether or not the true value 
(e.g., B, =0.5) is contained within the corresponding 
interval. The proportion of intervals containing the true 
values were computed across the 2,500 replications and are 
provided in Table 4. For the full data sets, the actual 
coverage for B,, for example, was 94.9% and for SRMI it 
was 95.4. In addition the average length of the confidence 
intervals were also computed. The average width of the full 
data confidence interval for B, was 0.91 and for SRMI the 
average length was 1.22. That is, the SRMI data resulted in 
well calibrated intervals estimates. 

The same simulation study was also used to compare the 
distributional properties of imputations from SRMI and a 
fully Bayesian method. For the model assumptions used to 
generate complete data, we developed a Markov Chain 
Monte-Carlo algorithm for drawing values from the actual 
posterior predictive distribution of the missing values given 
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the observed values. Each step of the draw used Metropolis- 
Hastings algorithm and required considerably more compu- 
tational time than the SRMI method. Therefore, only the 
first 500 simulated data sets were used in this comparison. 
We computed two Kolmogrove-Smimoff (KS) statistics 
from each simulated data set: One comparing the imputa- 
tions from the SRMI method and the actual hidden values 
and the other comparing the Bayesian imputations and the 
actual hidden values. There were no discernible differences 
in these two statistics across the 500 simulated data sets. A 
scatter plot of those 500 pairs of KS statistics showed a 
narrow scatter of points around a 45 degree line. 


Table 4 
Means and Standard Deviations for Standardized Differences 
Between SRMI Estimates and Full Data Estimates and Actual 
Coverage of Nominal 95% Confidence Intervals 


Regression Std. Difference Confidence Coverage 
Coefficient 
Mean SD SRMI Full Data 
Bo 8.2 2.0 96.1 95.4 
B, 8.8 ea 95.4 94.9 
B, 8.0 De 05:5 94.7 


6. DISCUSSION 


We have described and evaluated a sequential regression 
multivariate imputation procedure that can be used to 
impute missing values in a variety of complex data 
structures involving many types of variables, restrictions, 
and bounds. This procedure should be useful when the 
specification of a joint distribution of all the variables with 
missing values is difficult. A real advantage of the approach 
is its flexibility in handling each variable on a case by case 
basis. For instance, to preserve all the bivariate correlations, 
all the main effect terms must be included as regressors, and 
to preserve, say, three factor interactions all two factor 
interactions must be included as regressors in the imputa- 
tion model. Implementation of this procedure only requires 
a good random number generator and fitting routines for a 
variety of multiple regression routines. A SAS based 
application implementing this approach can be downloaded 
from a web site (www.isr.umich.edu/ src/smp/ive). 

In certain instances, one can modify the algorithm to 
reduce it to Gibbs sampling from the joint predictive distri- 
bution of the missing values given the observed values. 
However, the SRMI procedure will be more useful where 
an explicit model is difficult to formulate. In both the illus- 
trations and the simulation, different random starts were 
used to monitor imputed values, an important aspect in 
many practical applications. This is a good practice when 
Gibbs sampling is used under an explicit Bayesian model 
(Gelman and Rubin 1992) and should be used when the 
sequential regression method discussed in this paper is 
used. 
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The simulation study described in section 5, though 
limited, is favorable as far as inferences based on the SRMI 
are concerned. The imputations from SRMI and Bayes 
model were comparable. The goal here, however, was to 
develop an imputation approach that is finely tuned on a 
variable by variable basis fully conditional on all the 
observed information, rather than an explicit joint multi- 
variate distribution of all the variables. Furthermore, model 
sensitivity may be reduced by using a semiparametric 
regression model for each conditional regression. The 
Bayesian interpretation of the spline smoothing models 
(Silverman 1985) can be used to draw imputed values from 
the predictive distribution. Such modifications also deserve 
further investigation. 

For some large data sets with many variables, the SRMI 
can be computationally intense. The algorithm can be modi- 
fied to apply a variable selection method for each regression 
in each round. We compared the inferences with and 
without the variable selection on several large data sets such 
as the National Health Interview Survey and the National 
Medical Expenditure Survey using several hundred 
variables. The descriptive inferences as well as inferences 
based on linear and logistic regression models were very 
similar, still further detailed investigation is needed. 

It is also possible to use the imputation approach 
discussed in this paper in conjunction with, for example, the 
Jackknife Repeated Replication (JRR) technique for 
variance estimation. Specifically, (1) re-impute, singly, the 
missing values in each jackknife replicate SRMI; (2) 
analyze the imputed replicate data set; and, finally, (3) 
combine the replicate estimates to obtain the point estimate 
and its covariance matrix. This approach is more compu- 
tationally intensive than the multiple imputation approach. 
This integrated JRR imputation approach and several of its 
variations are currently under investigation. 

Finally, it has been assumed that the data set arises from 
a simple random sample design. However, most surveys 
employ complex sample designs involving stratification, 
clustering, and weighting. Further work is needed to modify 
the sequential regression method to incorporate complex 
design features not reflected in the X variables in expression 
(1). However, even if the imputation process ignores the 
complex design features, the analysis of completed data 
should be design based. Though this does not provide valid 
design-based inferences, it maintains the robustness under- 
lying the design-based analysis to a certain degree. The 
integrated JRR imputation approach discussed above may 
have more appealing design-based properties in a complex 
design setting. 
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APPENDIX: REGRESSION MODELS AND 
IMPUTATIONS 


Dropping the subscript indexing of the variables for 
brevity, the necessary steps for imputing each type of 
variable are as follows: 


Continuous variable: For Y (possibly transformed from 
the original scale for normality), a continuous variable, 
build a normal linear regression model, Y=UB +e, where 
U is the most recently updated predictor matrix, e has a 
multivariate normal distribution with mean zero and 
variance o7J, and / is an identity matrix. Suppose that 
§ =(B, logo) has a uniform prior distribution over the 
appropriate dimensional real space. Fit this model based on 
the individuals for whom Y is observed. 

Let B= (U'U)'U'Y be the estimated regression 
coefficient, SSE = (Y-UB)' (Y-UB) be the residual sum 
of squares and df =rows(Y)-cols(U) be the residual 
degrees of freedom, and T be the Cholesky decomposition 
such that TT‘ = (U‘U)"!. The relevant posterior distribu- 
tions can be derived easily (see, for example, Gelman, 
Carlin, Stern and Rubin 1995, Chap. 7), and the following 
steps then provide draws from the posterior predictive 
distribution of missing Y values: 


1. Generate a chi-square random deviate u with df degrees 
of freedom and define o, = SSE/u. 


2. Generate a vector z=(Z,,Z, 9%) of dimension 
p =rows (B)of random normal deviates and define 
Bra Beoslz. 

3. Let U,,., denote the U-matrix for those with missing Y 
values. The imputed values are Y, = U_;..B, +6,v, 
where v is an independent vector of dimension 
rows (U_,...) of random normal deviates. 

Binary Variable: When Y is a binary variable, fit a logistic 

regression model relating Y to U (most recently updated), 

logit[Pr(Y=1| U)]=UB, using individuals with observed 

Y. The imputed values for Y are created through the 

following steps: 


1. Let B denote the maximum likelihood estimates of B 
and V its asymptotic covariance matrix (negative 
inverse of the observed Fisher information matrix). Let 
T be the Cholesky decomposition of V (that is, 
TT'= V). Generate a vector z of random normal 
deviates of dimension rows (B). Define B, = B+Tz. 


Zpemcetat/) 


miss 


missing. Define P, = [1 + exp(-U 


denote the portion of U for which Y is 
‘niss P»)] '- Generate 
a vector u, of dimension rows (U_,..) of uniform 
random numbers between 0 and We ‘Impute 1 ifa 
particular component of wu is less than or equal to the 
corresponding component of P, and impute 0 
otherwise. 


This approach results only in approximate draws from 
the posterior predictive distribution of the missing values as 


the draws of the parameter B are from the asymptotic appro- 
ximation of its actual posterior distribution. It is possible to 
draw from the actual distribution by modifying Step 1 
using, for example, Sampling-Importance-Resampling 
(Rubin 1987b). 


Mixed Variable: For Y, a mixed variable (that is, Y either 
takes the value zero or a continuous value), model the zero 
values by a 0-1 indicator to distinguish between 0 and non- 
zero values, and then model a normally distributed variable 
for the continuous portion of the distribution conditional on 
the indicator variable being equal to 1. That is, use a two 
stage approach: impute a one or zero using the logistic 
approach described above; and then restricting the sample 
to those with non-zero values, use the continuous variable 
approach described above to impute a continuous value to 
replace the just imputed value of 1. 


Count Variable: For Y, a count variable, fit a Poisson 
regression model Y ~ Poisson (A) where log 4 = UB. The 
imputations for missing values in Y are created using the 
following steps: 


1. Let B denote the maximum likelihood estimate of B, V 
its covariance matrix and T the Cholesky decom- 
position of V. Generate a vector z of random normal 
deviates of dimension rows (B) and define B, = B+Tz. 


Deg et Oise denote the portion of U for which Y is 
missing. Define 1, =exp(U,,,,,B,). Generate inde- 
pendent Poisson random variables with means as the 


elements of i,. 


Polytomous Variable: For Y that can take k values, 
j=1,2,..,k, let %, =Pr(Y=j|U). Fit a polytomous 
regression model relating Y to U where wa (1, / 7.) = U B; 
for j =1,2,...,k - 1. Under the restriction i fia it follows 
that m, = (1+E*! exp(U B))™. 

Let B denote the maximum likelihood estimate of the 
regression coefficients (f,,B5,--.B,,), V be the 
asymptotic covariance matrix and TJ its Cholesky 
decomposition. 

The following steps create imputations: 


1. Define B, =B+Tz where z is a vector of random 
normal deviates of dimension rows (B). 


2. Let U,,,,, denote the rows of U with missing Y and let 
P; =exp{U,,,, B,-} /{1 +X,exp(U,,,,, B,-)} where B,. 
is the appropriate elements of B, where 
Pel 2. kel and ae 


3. Let Ry) =0,R, =; P, and R, = 1 be the cumulative 
sums of the probabilities. To impute values generate 
random uniform number uw and take j as the imputed 
category if Ri, sus R,. 


Again, the imputation of mixed, count and categorical 
variables are from approximate posterior predictive distri- 
butions because the corresponding parameters are drawn 
from their asymptotic normal approximate posterior 
distributions. 
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A Better Understanding of Weight Transformation 
Through a Measure of Change 


ABSTRACT 


The literature on longitudinal surveys of households offers several approaches for creating a set of final weights for use in 
data analysis. Most of these approaches depend on various procedures for modifying weights. Initial weights are often 
transformed into a set of intermediate weights in order to compensate for nonresponse, and then into a set of final weights, 
through poststratification, in order to adjust the sample. The literature includes a great deal of information about this 
approach but none of the studies has really looked closely at an approach for measuring the relative importance of these two 
steps in measuring the effectiveness of the numerous existing alternatives for creating intermediate weights. The objective 
of this paper is to study and measure the change (from the initial to the final weight) which results from the procedure used 
to modify weights. A breakdown of the final weights is proposed in order to evaluate the relative impact of the nonresponse 
adjustment, the correction for poststratification and the interaction between these two adjustments. This measure of change 
is used as a tool for comparing the effectiveness of the various methods for adjusting for nonresponse, in particular the 
methods relying on the formation of Response Homogeneity Groups. The measure of change is examined through a 
simulation study, which uses data from a Statistics Canada longitudinal survey, the Survey of Labour and Income Dynamics. 
The measure of change is also applied to data obtained from a second longitudinal survey, the National Longitudinal Survey 


JOHANE DUFOUR, FRANCOIS GAGNON, YVES MORIN, MARTIN RENAUD and CARL-ERIK SARNDAL! 


of Children and Youth. 


KEY WORDS: Nonresponse; Weighting; Calibration; Longitudinal survey; Measure of change. 


1. INTRODUCTION 


The literature contains many two-step approaches to 
transforming weights for household surveys. The first step 
involves an adjustment of the initial weights in order to 
compensate for nonresponse; the resulting weights are 
called intermediate weights. The second step produces the 
final weights through the process of poststratification, or 
more commonly through calibration (see Deville and 
Sarndal 1992), in order to ensure that the final weights 
respect certain known population control totals. All of these 
weight modifications are designed to produce the “best 
possible set of final weights”. 

At Statistics Canada, longitudinal surveys of households 
also use this two-step approach in weighting, and the 
research work undertaken by the Agency leans in this 
direction. The U.S. Bureau of the Census “Survey of 
Income and Program Participation (SIPP)” (see Rizzo, 
Kalton and Brick 1996) also uses this type of approach. 

Several methods are recommended in the literature for 
adjusting weights to compensate for nonresponse. Rizzo 
et al. (1996) compared the estimates obtained through 
several of these methods to estimates from independent 
sources. However, not many authors have done simulations 
or proposed tools for comparing the relative effectiveness 
of the methods in terms of their ability to reduce the 
nonresponse bias. 

The main objective of this document is to study and 
measure the change (between initial and final weights) 
resulting from the adoption of a two-step procedure for 
modifying weights. Thus, a measure of change involving 


four components is proposed in order to quantify the rela- 
tive impact of the nonresponse adjustment, the correction 
for poststratification and the interaction between these two 
adjustments. The second objective is to use the measure of 
change to compare the effectiveness of the different 
nonresponse adjustment methods through a simulation 
study based on data from the Longitudinal Survey of 
Labour and Income Dynamics (SLID) and from the 
National Longitudinal Survey of Children and Youth 
(NLSCY). The longitudinal surveys are unique in that a 
great deal of information about respondents and non- 
respondents to the latest wave is available from respondents 
to the previous waves. Thus, more complex methods can be 
used to adjust for nonresponse. 

A general framework for the weighting of longitudinal 
surveys of households is presented in section 2. Then, the 
measure of change which will be used to quantify the stages 
of transformation between the initial and the final weights is 
presented in section 3. Section 4 addresses the nonresponse 
adjustment strategies contained in the literature. This is 
followed by sections 5 and 6, which contain the results of the 
studies based on the SLID and NLSCY. The last section 
presents the conclusions of this study. 


2. GENERAL FRAMEWORK FOR 
LONGITUDINAL WEIGHTING 


In a longitudinal survey of households, individuals in the 
initial sample are followed over time, and are referred to as 
longitudinal individuals. This set of individuals is the one 
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which will be used in the studies presented in this 
document. They are referred to as the “reference unit’. This 
section provides an overview of the steps followed in order 
to modify the initial weight for longitudinal individuals into 
a final weight. 


2.1 Initial Weights 


U ={1,...,k,...,N}is a finite population. We are 
interested in variable y (the variable of interest), whose 
value for the k-th unit is recorded as y,. The objective is to 
estimate the total Y = DH Let wo, be the initial weight 
for all ke€s units, where s is the longitudinal sample. In the 
absence of nonresponse, the set of initial weights 
{wo,: kes} yields the ey Wo, Y, estimator for Y. In this 
case we assume that the wo, are normalized in order to 
ensure that L.w»,=N. Although Y is unbiased for Y, Y 
has the drawback of not incorporating any ancillary 
information in the form of known control totals for 
poststrata. 


2.2 Nonresponse Adjustment and Intermediate 
Weights 


Most surveys have to deal with nonresponse. Two 
approaches are often used to compensate for this: impu- 
tation and the correction of the initial weights of respon- 
dents through an adjustment factor. The latter is the one 
more commonly used in household surveys to compensate 
for total nonresponse, while imputation is often preferred 
when dealing with partial nonresponse. Total nonresponse 
reduces the size of the sample since the y, value is only 
available for ker, where rcs is the set of the m 
responding units. For this reduced set of data, the initial 
Wo, Weights are, on average, too small and we have 
LW i<N. The estimator Y'= Wo, Y;, is not admissible 
since it systematically underestimates Y. 

Weight adjustment is often chosen in order to compen- 
sate for total nonresponse in household surveys. A common 
method of adjusting weights involves constructing 
Response Homogeneity Groups (RHGs). These are 
designed so that each one is comprised of reference units 
having a similar probability of response. Then, within each 
RHG, an adjustment factor equal to the inverse of the 
RHG’s response rate (weighted or not) is calculated. For 
each respondent unit k, the adjustment for nonresponse 
involves multiplying w,, by the RHG’s adjustment factor. 
This operation results in a set of intermediate weights 
{w,,:ker}, where ) w,, = N. With these weights, we can 
construct the estimator pr . W1,Y,» which eliminates the 
underestimation which is Oieeacer aie of Y’=> Woe: 
As in the case of the set of initial weights, the main 
drawback with this set is that it fails to incorporate the 
ancillary information available for poststrata. 


2.3 Poststratification and Final Weights 


A widely-used practice in household surveys involves 
modifying the intermediate weights through poststrati- 
fication, or, more commonly, through calibration, so that 


the sum of the final weights on the set of respondents will 
correspond to the known population counts. Thus, postrati- 
fication produces a set of final weights {w,,: ker}, which 
incorporates the ancillary information and which is also 
consistent with the control totals for the poststrata. In this 
case, the final weights in each poststratum p confirm 
© ,,W2,=N,, where N, is the known element and r, is the 
set ate pesnondens units in the p-th poststratum. It follows 
that ©) w,, = N. Demographic and geographic variables 
are frequently used to define poststrata. The choice of 
poststrata, which must be sufficiently large, is limited by the 
availability of control totals. Several methods may be used 
to calibrate the intermediate weights to the selected control 
totals. 


3. MEASURE OF CHANGE FROM INITIAL TO 
FINAL WEIGHTS 


In this section, a measure of the change between initial 
and final weights is presented so to better understand the 
effect of the weight modification procedure. The break- 
down of this measure into four components makes it 
possible to quantify the effect of each of the weighting steps 
described in section 2. These components will be used in 
sections 5 and 6 in the comparison of various methods for 
adjusting weights for nonresponse. 

If the initial weights are normalized so that wy, = N, 
and if rcs, then the three sets of weights described in 
section 2 confirm the following relations: 


a Wor <Ns Wik =N,)) Wo, = 
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The ratio w,, measures the average change in the inter- 
mediate weight set in relation to the initial weight set. As 
total nonresponse becomes more pronounced, w,, shifts 
farther away from the value of 1, which is only obtained in 
the absence of nonresponse. The ratio w,, represents the 
average change in the set of final weights in relation to the 
set of initial weights. 

The w,, and w,, ratios measure the average change in 
weight. To measure an individual change in weight, we 
define; for _ every. kéer,r, jie NAD /(Wox Wo)» and 
Toon = W2~/(Wo, Wo). These quantities vary around 1. 
More specifically, their weighted averages equal 1: 
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The 7),, and ry,, quantities will be useful for measuring 
individual weight changes. 

The total weight change, from the set of initial to final 
weights, going through the set of intermediate weights, can 
be calculated by a measure of change, also called distance. 
Here, D is the following measure of change: 


a Wok 
; 


In fact, D is a weighted average of the following indi- 
vidual weight change factors: 


2 2 
WwW W. Ww 
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The measure of change D breaks down into four 
components, as set out in the following equation: 


Dee Ry tiie t RagteG, 
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yo 
G =(Wp, - 1). 


It should be noted that the measure of change D is 
always positive, equality being at zero when the two 
following conditions are met: 


(i) absence of nonresponse (r=s and w,,=Wo, for all k), 


(ii) absence of poststratification effect on the intermediate 
weights (w,, = w,, for all k). 


A high nonresponse rate would tend to increase the value 
of the measure of change D since in such a case, w,, is 
generally much larger than wo,. 

R,, measures the individual weight changes which result 
from going from the initial to the intermediate set. Later, we 
will see that the component R,, is somehow associated 
with the quality of the nonresponse model and that a large Rp, 
value is preferable. R,, measures the individual weight 
changes which result from going from the intermediate to 
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the final set. R;,, measures the interaction between the two 
types of change and G measures the change in average 
weight between the initial and final sets. 

In addition to its interpretation as a distance, the measure 
of change D can also be interpreted as a mean square error 
of changes Wx /Wo , 1m relation to 1, and in relation to the 
distribution defined by all the w,,. From this perspective, 
the component G corresponds to the bias squared (or the 
square of the difference between the w,, average of 
W,,/Wo, and 1), while the sum of the other three 
components corresponds to the variance. In the simplest 
case, where a nonresponse adjustment is calculated using a 
single RHG, and where no postratification is applied, we 
have wp, = N/n for all kes (in the case of a size n simple 
random selection) and w,, = w,, = N/m forall ker, (where 
the nonresponse adjustment factor is n/m, i.e., the inverse of 
the response rate). We then have D = G = {(n/m) - 1}? and 
Ro = Ry = Ri, = 0. 

Some significant conclusions may be drawn from 
looking at the relative importance of Roe Ris and Ri. If 
Ro, is high at the same time that R,, is not very high, the 
survey is one in which the nonresponse adjustment creates 
significant individual changes in weights, while poststrati- 
fication only results in a slight change in individual weights. 
However, when R,, is high, poststratification brings about 
very large individual changes. The results presented in 
sections 5 and 6 will show that R,, can be used to compare 
the effectiveness of various nonresponse adjustment 
methods. As well, the sign of R,_, indicates whether the two 
types of individual change are moving in the same direction 
(R,,, > 0) or in opposite directions (R,,, < 0). In reality, we 
expect R,_, to be very small, if not negligible. 


4. NONRESPONSE ADJUSTMENT 
STRATEGIES 


The literature contains several methods for adjusting 
weights (including the method described in section 2.2) to 
compensate for nonresponse. Another method, which is 
frequently used in longitudinal surveys, involves adjusting 
weights in accordance with the inverse of the predicted 
probability of response obtained through a logistic 
regression. We also find methods of adjustment based on 
calibration, which use marginal distributions of the initial 
sample or of the population. Singh, Wu and Boyer (1995) 
used this approach in order to derive a method of adjust- 
ment capable of producing coherent estimates in longitu- 
dinal surveys from one wave to the next. Deville (1998) 
recommended a method of correction for nonresponse by 
calibration or balanced sampling. For a review of 
nonresponse adjustment methods, refer to Kalton and 
Kasprzyk (1986), Platek, Singh and Tremblay (1978), 
Chapman, Bailey and Kasprzyk (1986) and to Little (1986). 
In this document, only methods relying on the creation of 
RHGs are considered. 
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4.1 Formation of RHGs 


In most surveys, aside from a few stratification variables 
from the sample frame, very little information is available 
about non-respondents. Therefore, the choice of RHGs is 
very limited and the strata are often used as RHGs. In these 
cases, the assumption is that the probability of response is 
the same for all units in a given stratum. However, in longi- 
tudinal surveys, a great deal of information about respon- 
dents and non-respondents in the current wave is available 
from the responses provided in the previous waves. This 
information can then be used to create RHGs within which 
the assumption of a uniform response mechanism is plausi- 
ble. This leads to a better nonresponse adjustment and, 
therefore, a reduction in the risk of introducing a 
nonresponse bias into the estimates. 


4.1.1 Method for the Selection of Variables for the 
Formation of RHGs 


By definition, an RHG is formed from a set of variables 
capable of predicting the propensity to respond. If the set of 
variables which is defined at the outset is too large, uni- 
variate tests may be used to isolate the most important 
variables to distinguish the characteristics of respondents 
from those of nonrespondents. With this set of important 
variables, a selection method may then be applied for 
retaining the best variables for explaining the propensity to 
respond. Two of the current variable selection methods are: 
the Logistic Regression Model (LR) and the Segmentation 
Model (SM). 


4.1.1.1 Logistic Regression 


’ Under the LR method, the combined use of the “fact of 
having responded to the survey or not” as a dependent 
variable, standardized weights and the “stepwise” proce- 
dure result in a list of the most significant dichotomic vari- 
ables for explaining the propensity to respond. As a general 
rule, RHGs are created according to 27 possible combi- 
nations, based on a set of g explanatory variables used. The 
LR is often referred to as the symmetrical approach. 
However, if certain additional constraints are applied when 
the RHGs are created, this could reduce their numbers. For 
instance, we could require a minimum number of reference 
units (n) and a response rate (RR) (weighted or not 
weighted) greater than a certain level in each of the RHGs. 
Kalton and Kasprzyk (1986) encourage the use of such 
constraints in order to avoid increasing the variance asso- 
ciated with extreme weights. However, these constraints 
may reduce the effectiveness of the nonresponse adjustment 
and result in an increase in the bias. When an RHG does not 
meet one of these constraints, it has to be combined with 
another RHG. The combination of RHGs continues until all 
of the RHGs meet the additional constraints imposed. This 
leads to 24 - J valid combinations, where J represents the 
reduction resulting from the combination of RHGs. 


For instance, in Figure 1, 27 = 8 RHGs are created on 
the basis of g=3 explanatory variables. The shaded boxes in 
Figure 1 represent the RHGs. An adjustment factor is 
calculated within each RHG and the weight w), of each 
reference unit is then adjusted, accordingly. 


4.1.1.2 Segmentation Model 


The SM method, which is referred to as non- 
symmetrical, is based on the CHAID (Chi-square 
Automatic Interaction Detection) algorithm developed by 
Kass (1980). It divides the sample into sub-groups accord- 
ing to the response rate of the explanatory variables by 
using a Chi-square test. The segmentation process 
continues until a significant explanatory variable is found. 
The final sub-groups created through the SM become the 
RHGs, for which the nonresponse adjustments are calcu- 
lated. As in the case of the LR, additional constraints may 
be imposed. 

In Figure 1 we see that the SM method divided the 
sample into several RHGs based on the different explana- 
tory variables. The RHGs are once again represented by the 
shaded boxes. The segmentation continues until it is no 
longer possible to find explanatory variables. 


4.1.2 Nonresponse Adjustment Factor 


Whether the RHGs are formed by relying on the LR or 
the SM, a uniform response mechanism is assumed within 
each RHG. Thus, the nonresponse adjustment factor is 
given by the inverse of the response rate (weighted by wo, 
or not weighted) for the RHG. 


5. EMPIRICAL STUDY BASED ON THE 
SURVEY OF LABOUR AND INCOME 
DYNAMICS (SLID) 


Data from the SLID were used for an empirical study 
designed to compare the effectiveness of the LR and SM. 
The SLID is a longitudinal survey of households that started 
in 1993; one of its objectives is to provide information on 
the economic well-being of Canadian society (see Lavigne 
and Michaud 1998). 

These two methods were tested through a simulation by 
analyzing some variables of interest and various domains. 
The components of the measure of change, the absolute and 
relative biases and the variances were studied. 


5.1 Description of the Empirical Study 


The first step in the empirical study was to estimate the 
probability of response to the first wave of the survey for 
each of the units in the longitudinal sample. Variables 
which could potentially explain the propensity to respond 
(based on a preliminary interview) were used to form a very 
large number of RHGs. All of the individuals in the sample 
were assigned to an RHG on the basis of the values of the 
explanatory variables. A probability of response was then 
estimated for each RHG on the basis of the weighted 
response rate. Then, only the respondents and their 
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Logistic Regression 


Females 


Segmentation Model 


Figure 1. Depiction of the Formation of RHGs by Method 


probability of response were retained in the reference 
sample for the simulation. Nonresponse was then generated 
for the reference sample through Poisson sampling. This 
procedure, illustrated in Figure 2, was independently 
repeated 100 times, thus creating 100 sets of respondents 
and non-respondents. The average response rate for each 
repetition was around 90%, which was the rate observed in 
the first wave of the SLID. 

For each of the 100 repetitions, a nonresponse adjust- 
ment was done using the LR method to create the RHGs. 
Similarly, a nonresponse adjustment was done using the SM 
to create RHGs for each of the first 20 repetitions. With the 
SM approach, the number of repetitions was limited to 20, 
given the stability of the results and since several manual 
interventions and the use of a specific software package (in 
our case: Knowledge Seeker - ANGOSS Software 1995) 
were required. 

Several variants of the variable selection method were 
studied: 


a) LR_i, where i represents, out of the 100 repetitions, the 
approximate average of the number of RHGs generated 
through the LR method. In this study, i=4, 16, 40, 60. 
For instance, for LR_40, the q=6 most important 
explanatory variables for the propensity to respond 
were first identified. The RHGs were then formed 
using the (2% - J) valid combinations of these g=6 
explanatory variables. The imposition of additional 
constraints (n > 30 and RR>50%) in each RHG led to 
the re-grouping of some RHGs. On average, out of 100 


repetitions, 24 RHGs had to be regrouped (J=24) and 
a total 27-J =2°-24=40 RHGs were formed, 
hence the LR_40 designation. In the simulation study, 
LR_i, where i=4, 16, 40, 60 RHGs corresponds, 
respectively to g =2, 4, 6, 8 explanatory variables. 


b) SM_4i, where i indicates the approximate average in the 
first 20 repetitions of the number of RHGs generated 
through the SM method. In this study, i=16, 25, 40. For 
example, for SM_16, one SM was used with a signi- 
ficance level p of 0.0001. After the imposition of the 
same additional constraints as for the LR, an average 
16 RHGs were created. SM_i, where i=16, 25, 40 
RHGs corresponds, respectively, to the significance 
levels of 0.0001; 0.0005; 0.0025. The higher the level 
used, the easier it is to identify the significant differ- 
ences, which makes it possible to achieve a more 
detailed segmentation and, hence, a greater number of 
RHGs. 


c) A method with a single RHG (1_RHG) was also used 
for comparison purposes. This method involves 
defining the entire sample as a single RHG for each of 
the 100 repetitions. It should be noted that this method 
is only effective if the response mechanism is uniform 
within the entire sample, which is rarely the case. 


At first, the initial weights were normalized so that 
DD) Wo, = N, in order to eliminate the effect of under- 
coverage and to better isolate the effect of nonresponse. 
Thus, G will only measure the average change caused by 
the nonresponse adjusment. 
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the 
simulation 


SLID sample 


Figure 2. Illustration of the Simulation Process 


Once the initial weights were normalized, each set of 
final weights was then the result of a two step process: a 
nonresponse adjustment (based on one of the eight methods 
mentioned: 1_GRH, LR_i, where i = 4, 16, 40, 60 and 
SM_i, where i = 16, 25, 40) and a same poststratification 
(14 age-sex groups by province). 


5.2 Analysis of the Results of the Empirical Study 


For each of the methods discussed in the previous 
section, the components of the measure of change D were 
studied. Also, the average, absolute and relative 
nonresponse bias and the average variance of the estimates 
were analyzed. 


5.2.1 Measure of Change (D) 


Table 1 presents the average value of D and its 
components for each of the M repetitions (where M=100 for 
the LR and M=20 for the SM) as well as the percentage 
contribution of each element to the average value of D. We 
observe, in the first place, that for the 1_GRH method, Ro, 


i: repetition 
r: set of respondents 
nr : set of nonrespondents 


is nil since one single nonresponse adjustment was made to 
the set of respondents. Thus, w,,=QW,, where a is a 
constant, SO 79, = 1 for every ker and Roi ae, We also 
observe that D increases as the number of RHGs increases, 
irrespective of whether the LR or SM method is used. Thus, 
the more RHGs there are to compensate for nonresponse, 
the greater the total change to which the weights are 
subjected. In addition, the values of D are higher for the SM 
than for the LR. 

For the LR and the SM, the contribution of Rp, to the 
measure of change increases as the number of RHGs 
increases, since nonresponse is more readily targeted as the 
number of RHGs increases. Consequently, the nonresponse 
adjustment often becomes more important and, thereby, the 
weights vary more and more. In addition, the contribution 
of Rp, to the measure of change is much more important 
with the SM than with the LR. This indicates that the SM 
seems to be better at modeling nonresponse and isolating 
the specific trends of the LR. 


Table 1 
Average Value of D on Repetitions, for each Component and their Contribution (as a %) to the Measure of Change 
for each of the Eight Nonresponse Adjustment Methods 


Method D Ror Ry/D Re 
(x107) (%) (x10°) 
1_RHG 0.012135 0.00 0.00 le 
LR_4 0.012952 0.78 6.04 1.10 
LR_16 0.013809 1.66 11.97 1.00 
LR_40 0.014426 2.32 16.02 0.96 
LR_60 0.014948 2.85 19.00 0.95 
SM_16 0.015712 3.42 21.33 0.97 
SM_25 0.016713 4.44 26.02 0.95 
SM_40 0.018202 5.97 32.37 0.95 


Ri/ Die ena G GID 
(%) _(x10°) —(®) (x10?) (%) 


9.66 0.00 0.00 111 90.34 
8.49 0.06 0.01 1.11 85.46 
Thay 3.76 0.54 1 COMS: 
6.66 4.02 0.55, SW | TKS TA 
6.35, 34/5 0.49 A 74215 
6.19 3.40 0.43 172.05 
S13 2.95 0.36 11 67-89 
5.23 1.20 0.14 1 62.26 
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As for R,,, it is almost constant, regardless of which 
method and number of RHGs are used. However, despite 
the fact that it changes very little, its contribution to the 
measure of change diminishes as the number of RHGs 
increases. This is due to the fact that there is more variation 
in the weights with a nonresponse adjustment, and the 
modifications which poststratification creates in the weights 
are less and less important as the number of RHGs 
increases. 

In the case of R,_,, its value is negligible and its contribu- 
tion to the measure of change is very small. This means that 
the interaction between the nonresponse adjustment and 
poststratification is practically nil. 

Finally, G remains constant, irrespective of which 
method and how many RHGs are used. As with R,,, the 
contribution of G to the measure of change diminishes as 
the number of RHGs increases. A larger number of RHGs 
is better at targeting nonresponse, thereby causing more 
variations in the set of intermediate weights. 
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Since, with all of these methods, G is constant, Rin iS 
close to zero and R,, is nearly constant, it is clear that the 
variations in D are mostly influenced by the variations in 
Reve 
cen 1 shows the average contribution in percentage of Rp, 
and R,, to the measure of change. For LR and SM, the 
contribution of Rp, increases with the number of RHGs 
while that of R,, diminishes. Also, the contribution of Rp, 
is greater for SM than for LR, while that of R,, is less for 
SM than for LR. The profile of the contribution of R,, is 
the same as the profile of D (Table 1). This confirms that 
the variations in the measure of change are mainly due to 
the variations in R,,. 

Graph 2 shows the comparison between the LR and SM 
in terms of the average percentage contribution in percent- 
age of Rp, to D. For a given number of RHGs, Ro, contri- 
butes to a larger percentage of D through the SM method 
than through the LR method. This means that individual 
changes in the weights between the initial and intermediate 
sets are greater for SM than for LR. 


Graph 1: Average contribution of Ro; and Rj. to the 
measure of change (D) for each method 
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Graph 2: Average contribution of Ro; to the 
measure of change (D) foreach method 
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5.2.2 Relative and Absolute Biases 

The Relative Bias (RB) and the Absolute Bias (AB) 
were used to compare the performance of LR relative to SM 
in reducing the nonresponse bias: 


Y. x 
np, = 10 F and AB. =¥, -Y; 


where Y ; is the estimate of the variable of interest obtained 
for the i-th repetition, i = 1, 2, ..., M, M=100 for the LR, 
M=20 for the SM and Y is the total for the variable of 
interest obtained from the reference sample. 

The Average Relative Bias (ARB) and the Average 
Absolute Bias (AAB) are calculated by taking, respectively, 
the average of the RB and the AB for all repetitions: 

M M 
ARB = - 3 RB, and AAB =1 > ap, 
M i=l M i=l 
where M=100 in the case of the LR and M=20 in the case of 
the SM. 

For the 100 repetitions, national estimates were produced 
for the following three variables: “person living, or not, in 
a family whose revenue is less than the Low Income Cutoff 
(LICO)”, “Individual Total Income (TI)” and “Individual 
Wages and Salaries (WS)”. The ARB for each estimate was 
calculated for the eight methods under study. Given the 
large sample size, the low nonresponse rate (10%) and the 
fact that a large number of control totals was used for 
poststratification, the ARB is very small (see Table 2) for 
each of the methods used. 

In Table 2 we see that, for each of the three variables, the 
ARB is more or less constant for the SM, irrespective of 
how many RHGs are used. Also, for the LR, the ARB for 
the TI and SW is more or less constant not withstanding the 
number RHGs used. On the other hand, for the LICO, the 
ARB for method LR_4 is much smaller than the ARB for 
the other three LR methods. This could be due to the fact 
that the LICO is a variable derived from several other 
variables, unlike the TI and the SW, which are observed 
variables. The ARB for the three variables for method 
1_RHG is much larger than the ARB produced by the SM 
and the LR, except for the LICO, since in this case the ARB 
is more or less equivalent to the ARB of the LR. Thus, it 
appears that method 1_RHG does not perform as well as the 
SM and the LR. In the best case, it is more or less equi- 
valent to LR. Unlike SM, we observe that the progression 


of ARB is not strictly downwards for the LR, as the number 
of RHGs increases. 

Despite the fact that the ARB is minimal for the 
variables studied for Canada, it can increase rapidly for 
small domains. In this study, other domains were also 
reviewed. Although some variances were observed in 
several of these cases, it seems that the ARB for the SM is 
generally smaller than the ARB for the LR and the method 
1_RHG. A more detailed study of a larger number of 
interest and domain variables would be beneficial for 
corroborating these conclusions. 

As previously indicated, the individual changes in the 
weights caused by the nonresponse adjustments are greater 
for the SM than for the LR (see Graph 2). This would 
suggest that the SM is more effective in reducing the 
nonresponse AB for a fixed number of RHGs. Graph 3 
confirms this observation, showing that the AAB for the 
LICO is smaller through the SM than through the LR 
method. 


5.2.3 


Variance estimates were produced for the three variables 
of interest through the Jackknife method. For LICO (Graph 
4), the average variance of estimates is approximately the 
same, regardless of the method used. However, there is a 
slight decrease when the number of RHGs increases, for 
both the LR and the SM. Also, based on the empirical 
study, average variance estimates for the SM are slightly 
smaller than for the LR. Therefore, the larger dispersion in 
the weight (a higher value for D) does not entail an increase 
in variance. 


Variance Estimates 


6. APPLICATION TO THE NATIONAL 
LONGITUDINAL SURVEY OF 
CHILDREN AND YOUTH (NLSCY) DATA 


In this section, most of the analyses done with the help 
of the LR and SM in the empirical study with data from the 
SLID are reproduced with the information obtained from 
the NLSCY. Just like the SLID, the NLSCY is a longitu- 
dinal survey of households. It started in 1994 and is 
designed to collect information for analyzing policies and 
developing programs addressing critical factors affecting 
the development of children in Canada (see Michaud, 
Morin, Clermont and Laflamme 1998). 


Table 2 
ARB (as a %) for Different Variables Based on the Methods — Canada 
Variable STUDIED METHOD 
1_RHG LR_4 LR_16 LR_40 LR_60 SM_16 SM_25 SM_40 
LICO 0.37 0.15 0.43 0.37 0.31 0.14 0.12 0.08 
TI -0.32 -0.09 -0.06 -0.05 -0.06 -0.006 -0.005 0.002 
WS -0.44 -0.13 -0.15 -0.19 -0.14 -0.10 -0.09 -0.09 
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Graph 4: Average variance for the LICO for each 
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6.1 Description and Analysis of the Results of the 
Application 


The following methods were used for this study: LR_i, 
where i=4, 14, 41, 70 with, respectively g=2, 4, 6, 8 
variables, and SM_i, where i=19, 36 with significance 
levels of 0.001 and 0.005, respectively. The same two 
constraints imposed for the SLID were re-applied when the 
RHGs were created. The same poststratification was used 
(22 age-sex groups by province) for each of the methods 
under study. 

Unlike the empirical study based on the SLID, only the 
data collected in the first two waves of the NLSCY were 
used. There was no simulation and the initial weights were 
not normalized 2) Wok = N<N ). It should be noted that 
the undercoverage of the NLSCY is around 13% and its 
nonresponse is around 8%. 

The conclusions drawn from the results presented in 
Table 3 are similar to those obtained in the simulation 


SESH AM 
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SM_16 SM_25 SM_40 


(Table 1). However, we observe that the relative contribu- 
tion by Rp, to the measure of change is weaker for the 
NLSCY than for the SLID. This result indicates that the 
nonresponse adjustment of the SLID produces larger indi- 
vidual changes in the weights, thereby resulting in a larger 
contribution by R,,. Therefore, the nonresponse adjustment 
in the case of the NLSCY had no significant effect on the 
individual changes in the weights, contrary to what was 
observed in the case of the SLID. 


The relative contribution by R,, to the measure of 
change is higher for the NLSCY than for the SLID. This 
result indicates that the more refined poststratification of 
the NLSCY results in greater individual changes in the 
weights, which translates into a greater contribution of R,,. 
Therefore, the NLSCY benefits a great deal from poststrati- 
fication, which is less important for the SLID. 
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Table 3 
Value of D, for each Component, and of their Contribution (as a %) to the Measure of Change 


for each of the Six Nonresponse Adjustment Methods 


Method D Re Revd Ry R,,/D RY REWD G G/D 

(%) (%) (x10) (%) (%) 
LR_4 0.1475 0.0052 3.51 0.0369 25.05 -4.63 -0.31 0.1058 71.76 
LR_14 0.1497 0.0075 5.00 0.0367 24.69 -5.50 -0.37 0.1058 70.68 
LR_41 0.1530 0.0112 7.29 0.0369 24.13 -9.16 -0.60 0.1058 69.18 
LR_70 0.1564 0.0144 9.21 0.0362 23.13 -0.19 -0.01 0.1058 67.67 
SM_19 0.1608 0.0187 11.63 0.0371 23.07 -8.24 -0.51 0.1058 65.81 
SM_36 0.1640 0.0220 13.41 0.0373 22.76 -11.30 -0.69 0.1058 64.52 


With respect to R;,,, as with the SLID, its contribution to 
the measure of change is negligible. Contrary to the SLID, 
the sign of R,_, is negative, which means that the interaction 
between R,, and R,, is negative. 

With respect to G, as in the case of the SLID, it is the key 
source of contribution to the measure of change. In the case 
of the NLSCY, G not only includes the average change in 
weight resulting from the nonresponse adjustment, but also 
the average change in weight resulting from the correction 
for undercoverage through poststratification. 

When all of these results are compared, it becomes 
evident that the two surveys are very similar since R,,,~0 
and the sum of the contributions to the measure of change 
of Rp, and R,, is around 35% in both cases. However, the 
NLSCY is also very different from the SLID since R,, 
predominates in the former one, while R,, predominates in 
the latter. 

Just as with the SLID, D increases with the number of 
RHGs and this measure is greater for the SM than for the 
LR. In fact, the value of D is greater for the NLSCY than 
for the SLID, mainly because of the NLSCY under- 


coverage, which results in an increase in G and, therefore, 


in D. 

The average contribution of Ry, for the LR and the SM 
increases with the number of RHGs, whereas that of R,, 
diminishes (Graph 5). The contribution of Rp, is also 
greater for the SM than for the LR, unlike the contribution 
of R,,, which is smaller for the SM than for the LR. 

As was observed with the empirical study, the profile of 
the contribution of R,, to the measure of change is the 
same as that of the measure itself. This shows that the 
variations in D depend directly on Rp,. 

Graph 6 enables us to compare the LR and the SM, 
presenting the average contribution of R,,, to the measure 
of change for the methods with an essentially equivalent 
number of RHGs. As with the SLID, the results indicate 
that nonresponse seems to be better targeted with the SM 
than with the LR method. 

Unlike the SLID simulation study, the bias was not 
evaluated since no external source of data was available for 
evaluation purposes. 
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Graph 5 : Average contribution of Ro; and Ry to the measure of 
change (D) for each method 
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Graph 6 : Average contribution of Ro, to the measure of change (D) 
for each method 
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Contribution of Ro, to D 
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7. CONCLUSION 


This document highlights the fact that the choice of 
RHGs and method for defining them depends on the: i) 
availability of ancillary information, ii) need to reduce the 
nonresponse bias for all estimates, and iii) time and 
operational constraints. The empirical study, as well as the 
NLSCY data, showed that the SM method appears to be 
better than the LR one in reducing the nonresponse bias. 
The results also demonstrated that the proposed measure of 
change can be a very useful tool for comparing different 
weighting strategies. 

In particular, it would appear that, as the value of Rp, 
increases, the reduction of the bias obtained from using 
RHGs increases. Given the difficulty in obtaining a reliable 
estimate of the nonresponse bias in a survey, the relation- 
ship identified between the size of R,, and the decrease in 
the bias suggests that Rj, should be used as a tool for 
evaluating nonresponse adjustment methods. This requires 
that R,, first be determined for different RHG sets. Then, 
the set with the highest R,, value is likely to be more 
effective than the other alternatives in reducing the 
nonresponse bias for most of the variables of interest. 

The measure of change presented could also be used to 
compare the different calibration strategies. In this case, the 
nonresponse adjustment could remain the same for all of the 
poststratification methods under study. A detailed study of 
the behaviour of R,, could be done and would no doubt 
lead to certain conclusions, as this study did about R,,. This 
type of study would not necessarily have to be restricted to 
the longitudinal context but could quite readily be done 
with a cross-sectional study. Also, the measure of change 
could be useful in evaluating different nonresponse 
adjustment methods in cross-sectional surveys. 
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Sampling and Weighting a Survey of Homeless 
Persons: A French Example 


PASCAL ARDILLY and DAVID LE BLANC! 


ABSTRACT 


In 2001, the INSEE conducted a survey to better understand the homeless population. Since there was no survey frame to 
allow direct access to homeless persons, the survey principle involved sampling the services they received and questioning 
the individuals who used those services. Weighting the individual input to the survey proved difficult because a single 
individual could receive several services within the designated reference period. This article shows how it is possible to 
apply the weight sharing method to resolve this problem. In this type of survey, a single variable can produce several 
parameters of interest corresponding to populations varying with time. A set of weights corresponds to each definition of 
parameters. The article focuses, in particular, on “an average day” and “‘an average week” weight calculation. Information 
is also provided on the use data to be collected and the nonresponse adjustment. 
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1. INTRODUCTION 


In 2001, INSEE conducted a survey to better understand 
the homeless population. This was the first representative 
survey of this type in France (A survey of this type was 
conducted in the United States in 1991 by Research 
Triangle Institute (RTI) in the Washington metropolitan 
area (RTI 1993)). The survey principle was to reach 
homeless persons through the services provided to them, 
specifically, overnight accommodation and meals. Obvi- 
ously, a person could use one or more of the services of the 
survey frame during the reference period considered, which 
creates a problem when weighting the survey’s individual 
data files. In this article, we will show how the weight 
sharing method can be applied to this problem. In this type 
of survey, unlike most traditional household surveys, a 
single variable can produce several parameters of interest 
corresponding to different population concepts: the ones 
used most often by practitioners are the “ average day” and 
“average week” parameters. A set of weights corresponds 
to each definition of parameters. We will provide precise 
definitions of these concepts and will focus in particular on 
the practical calculation of the corresponding weights. The 
article is laid out as follows: we will begin by stating the 
objectives of the survey, identifying its reference population 
and describing its sample design. We will then introduce 
the parameters of interest and derive the estimators of these 
parameters using the weight sharing method. We will 
describe the practical application of “average day” and 
“average week” weight calculations. Lastly, we will discuss 
practical considerations related to the nonresponse 
adjustment. 


2. “HOMELESS” SURVEY 
2.1 Objectives of the Survey 


The purpose of the survey conducted by the INSEE in 
February 2001 was to obtain a better understanding of the 
“homeless” population. This population is normally defined 
by default as all persons who do not have a fixed residence. 
It is a population that is not captured by traditional house- 
hold surveys conducted by the Institut since such surveys 
have an accommodation survey frame. Since there was no 
sampling frame for this population, the survey principle 
involved reaching the target population through the services 
provided to persons in difficulty, specifically accommo- 
dation and meals. These service are provided at certain 
times that vary depending on their nature: meals are 
provided every day at noon and in the evening, while over- 
night accommodation is provided once a day. 

This indirect sampling introduces two biases into the 
population initially targeted and the population actually 
surveyed. First, the entire target population is not surveyed: 
only those members who use the services in the survey field 
are potentially sampled. Second, the population actually 
surveyed contains individuals who do not belong to the 
population initially targeted to the extent that the services 
provided primarily for homeless persons are also used by 
persons who live in a regular household but who are in a 
vulnerable situation (this is especially true in the case of 
meals). Throughout this article, while keeping this 
distinction in mind, we will however sometimes use the 
expression “homeless” to designate the persons using the 
services in the survey field. 
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2.2 Reference Population 


The main feature of the services surveyed is that they are 
provided in specific locations; this location is accordingly 
called a centre. Several types of services correspond to a 
given centre. The statistical unit sampled, which we will 
call a service, will be defined as a quadruplet (service, day, 
time interval, person): it consists of a given type of service 
in a given centre, on a given day, in a given interval of time, 
to a given person. Of course, a person could receive several 
services on the same day, let alone in a given week or 
during the survey month. 

The survey reference period covers one month 
(January 15 to February 15, 2001). The total number of 
days in the survey reference period is designated as J, 
denoted by the index /. 

The geographic field of the survey is that of population 
centres with more than 20,000 inhabitants. 

The services in the survey field are those that are 
provided by one of the two types of services retained - 
meais and accommodation - when they are provided at least 
one day during the survey reference period. 

The reference population, designated as P (J), consists 
of persons who receive at least one service in the survey 
field during the reference period. 

This population of interest depends fundamentally on the 
reference period. Its size increases with the length of that 
period, but “more slowly” than the time: in actual fact, 
certain people are found in the centres every day. In reality, 
the change in P(J/) in relation to J is complex because there 
are two separate phenomena coming into play that would 
appear to have different characteristic times: 


— at any given time, the “homeless population” only 
occasionally visits the centres in the frame: to claim to 
cover that population, it would be necessary to survey 
over a period of time that would ensure that all persons in 
this population had used the services at least once (this 
period is not known but it is acknowledged in France, 
“according to the experts”, that the population not 
covered during one full month of winter is negligible). 


— the “homeless” population is self-renewing over time. 
Year to year, there are no doubt numerous persons 
coming into and going out of this population, linked to 
demographic change or economic or structural changes 
in society (persons coming into and going out of 
vulnerable situations). 


The question of how to determine J ultimately comes 
down to knowing whether interest is mainly in a concept of 
homeless “at a given moment” (J is relatively short) or a 
concept of homeless over a long period of time (J relatively 
long). The approach adopted by the INSEE is a compromise 
between the two. 


2.3 Sample Design 


The survey’s sample design has three stages: selection of 
population centres, selection of centres and time intervals, 
and selection of services. 


2.3.1 Selection of Population Centres 


The first stage of the sample design consists of selecting 
the population centres, based on a size criterion defined as 
a combination of the population of the population centres 
and the ability to provide services so that they could be 
identified in the records of associations and of the Ministére 
de la Santé. This first selection stage was carried out several 
months before the other two. This screening was necessary 
because the exhaustive census of the centres and the data 
related to them (type of service provided, average capacity, 
days open, ...) was then carried out in the selected popu- 
lation centres. This operation was done twice: a detailed 
survey the year before the data collection and an update just 
before the start of the data collection. This process 
produced a survey frame of centres. This frame has a 
fundamental role: persons who used only non-identified 
centres were not be sampled. 


2.3.2 Selection of Centres, Days and Time Intervals 


For practical reasons, it was not possible to survey all of 
the centres and to keep an interviewer on site at a given 
centre the entire day. Nor was it possible to interview 
everyone in a centre. It was therefore imperative to sample: 


— centres in the selected population centres (index c) 

— survey days during the collection period (index /) 

— intervals of time during the survey days (index f). 

— persons within one of the selections (centre, day, time. 
interval). 


For theoretical reasons, time intervals were defined in 
such a way that an individual could not receive two 
different services during a single time interval (for 
example, one of these time intervals was the period from 
11:00 a.m. to 2:00 p.m.). It was not reasonably possible to 
measure the links to the survey frame unless the persons 
interviewed could easily identify in time and space the 
services they received during the survey period. In the case 
of centres offering meals, one time interval covered the 
noon meal and one time interval covered the evening meal. 
It was assumed that an individual could use only one centre 
during the time interval corresponding to the noon meal, 
otherwise it would be necessary to ask the individual if he 
had already received a meal somewhere else or if he had 
eaten twice in the same centre. It was also determined that 
the length of an interval ensuring use of only one service 
was also the length of time that an interviewer could 
reasonably be asked to remain on site interviewing (two to 
three hours maximum). ( Note that daytime accommodation 
is not part of the services included in the survey field. This 
restriction of the field reflects two concerns. First, it would 
be very difficult to divide the day into time intervals of 
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three or four hours and to determine the links using this 
breakdown (the memory effort required of the person 
interviewed would be significant and did not seem reason- 
able to the survey’s designers). Second, it is very difficult 
to predict the use of these services. We wanted to avoid 
having a team of interviewers go to a site and not be able to 
conduct any interviews because of lack of use.) 

In actual fact, there is no fundamental difference 
between the sampling of the centres and the sampling of the 
periods of time: the relevant units to be considered are the 
triplets (c,j,t) that correspond to the overlap between a 
centre, a day and a time interval. Some of the boxes in the 
“time” and “centres” cross-tabulation table can be elimi- 
nated automatically prior to the selection, either because the 
centre is closed during the time slot considered, or because 
there is clearly not enough use. (In the latter case, caution 
must be exercised with respect to the possible restriction of 
the field should it be found that persons use only this centre 
and only attend during this time slot. If the latter are 
atypical, biases will be introduced into the estimations.) 

The selection method used was a random selection of the 
triplets (centres, days, time intervals) in proportion to the 
size of the centres obtained during the centre census. (In 
practice, in order to avoid difficulties with centre officials, 
time intervals were grouped together when a centre was 
sampled more than four times during the survey period.) 
Centres were stratified by type. (For accommodation 
services, centres were stratified by the criteria of men 
only/women only/mixed accommodation.) However, since 
this “precautionary” stratification does not apply directly to 
the observation units, it is useful only if the behaviour of 
the individuals differs significantly by the type of centre in 
which they are found. 


2.3.3 Selection of Services 


This last stage of the sample design consisted in 
completing the sampling of services, that is, in selecting 
individuals in a selected centre on a given day during a 
given time interval. The data collected during the census of 
the centres were not generally enough to constitute a survey 
frame of services. Some accommodation centres had lists: 
this was the more positive scenario where persons could be 
selected using these lists. However, at the majority of 
centres (for example, a soup kitchen), it was not even 
known how many people would show up in a given time 
interval: it was therefore not possible to develop a survey 
frame of services. Sampling of the services was done on an 
equal probabilities basis. As is traditional in multiple stage 
surveys, selecting a constant number of services (last stage) 
ensures constant probabilities of selection and thereby 
limits the risk of expanding sampling variances. 

In practice, the selection method used varies from one 
type of centre to the next, depending on the topography of 
the sites; existing list, waiting list, arrivals spaced over time, 
population “grouped” in no order at a single site at the 
same time, efc. It also takes into consideration the 


i 


maximum number of interviews that can reasonably be 
done by the interviewer or interviewers during the survey’ s 
time interval, and the fact that it is not desirable to keep the 
sampled persons too long after the closing of a centre or 
after meal service has stopped because of the risk of 
increasing the nonresponse rate. 

In all instances, a “counter” counts the number N of 
services provided during the sampling period. This is 
crucial to determining the selection probability of the 
sampled services. At the same time, the counter carries out 
a standard systematic selection (ideally, the selection should 
be done by another person (or “sampler’”) to avoid measure- 
ment errors in the use. For budget reasons, it was not 
possible to resolve this problem) using the following 
method: 


— in centres where a list was available, n services were 
selected, n being set before the survey; 


— in centres without a list, services were selected with a 
fixed f sampling ratio. f is determined based on the 
number of expected services N and the number of 
services that we wanted to sample 7 in order to ensure 
equal selection probabilities. In these cases, the size of 
the sample was not known in advance. 


3. PARAMETERS OF INTEREST 


The quantities of interest are essentially totals or ratios. 
We want to estimate a total in relation to a variable y 
defined for the population P(/), 


Y= Ds Vie (1) 


keP(VJ) 


One specific example of these totals is the size of P(J), 
N,=card(P(J)) = A I. 

We also want to estimate the average of y in the 
reference population, 


Vie (2) 


For example, y can be the nationality of the individual, 
the age at which he completed his education, or the number 
of centres that he visited the day of the interview. 

We then have to distinguish between two types of 
variables: 


— variables that are fixed during the survey reference 
period (such as, age at time of completion of education); 


— variables that vary during the survey reference period 
(y, = y,(J)). The number of centres visited on the day of 
the survey fall into this category. 


112 Ardilly and Le Blanc: Sampling and Weighting a Survey of Homeless Persons: A French Example 


We will begin with the variables that are fixed during the 
survey reference period. Section 6 looks briefly at those 
variables that change during that period. 


4. ESTIMATION OF A TOTAL OR RATIO IN 
CASES WHERE THE VARIABLE OF 
INTEREST IS CONSTANT DURING 
THE SURVEY PERIOD 


For the convenience of the discussion, we will not 
present explicitly all of the selection stages. Instead, we will 
use as an example a population centre sampled at the first 
selection stage. 


We note: 


C:  allcentres in the population centre open at least one 
day during the survey period, denoted by index c 


II, . , :all services provided in centre c on day j during time 
interval t, denoted by index i. 


II, _: all services provided in the population centre on day 


Ibe : : : 
j during time interval ¢. 
Pj , all persons who visit centre c on day j during time 
interval t, denoted by index k. 
P._: all persons who visit a centre in the population 


centre on day j during time interval t. 


Based on the definition of the time intervals, we find that 
for each individual keP., , there is one and only one service 
i. Thus, there is a one-to- “one correspondence between Pie 
and II; , . In other words, for every couple (j, t), the PG 
are separate. On the other hand, P, , a and P.. can 
have a non-empty intersection, when ft # t* 

The population of interest is therefore written 

PO) = Ure. ro UES 


Cyt CEG 


The central point of the reasoning consists in expressing 
the total of one variable of the population of individuals 
(which is our total of interest) as the total of another 
variable of the population of services (which are the 
sampled units), since estimation of the latter does not pose 
any particular problem. To obtain this result, we can use 
direct reasoning or apply the weight sharing method, either 
of which may seem more natural. 

Using direct reasoning, we define the application K, 
which links to each service i received during reference 
period J in all of the centres in the survey frame the 
individual who received that service. 


K : {services }— {individuals} . 
i> K(i) 


The population of interest P(J) is represented by K of 
II(J ), all services provided during the reference period in 


all centres in the survey field. Foreach ke P(J), we define 
r(J) = card(K “1(k)), the number of services provided to 
individual k during period J in all centres in the survey 
field, which we will also call the “number of links”. 

This gives us the fundamental equation: 


YK 
Y¥,=0 oa, oi oe o (3) 
keP) is) Pey(J) 
Since variable y takes the same value for all services 7 
“pointing” to individual k, such that K(i)=k, the 
right-hand side can be written 


x A aca Ve 
ieM(J),K@)=k TYCJ)| repy 7) Lien Kw =k 


But the quantity in the square brackets is the number of 
services provided to individual k during period J, or r,(J), 
which proves the equation. 

We can then see y, i) 
service i and write y, in plage of Yer» 
ryJ). By using z, = y,/r(J), 

Lays 


keP(J) 


as attached to corresponding 
bey r{J) in place of 

Ln) Zj, we get 

Formula (3) is none other than the weight sharing 
formula. The above reasoning is actually the reasoning 
underlying this method. (Only the expressions change; the 
weight sharing method describes the links between the 
sampled population and the population of interest by a 
matrix rather than an application, a single unit of the 
sampled population being able to “point” to several units of 
the population of interest.) The principle of this latter 
method is set out in Appendix 1. 


4.1 Estimation of a Total 


Let us now assume that we have a sample s,, of services 
to which a set of weights is linked (w,); es We assume 
these weights are unbiased (this is the inverse of the 
probabilities of inclusion of services in the sample). s, 
implicitly defines a sample of individuals s,,, which is 
actually all of the individuals who receive the sampled 
services. The weight sharing formula (see Appendix 1) 


ensures that the estimator 
oe » y,W 
Sp 
is unbiased, where we write for every ke Sp: 


es 1 
W,= 


W;. 4 
rVJ) spp K(i) =k @ 


Formula (4) simply states that an individual’s weight is 
equal to the sum of the weights of the services that were 
used to “catch him’, divided by the number of links with 
the survey frame,r, (J). In this way, it is possible to work 
directly on the individuals sampled: for each individual k , 
we calculate the weight W,, and we estimate the total Y, 
by, 77. 
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Figure | gives a fictitious sampling example. The service 
universe contains 13 services, provided to 8 persons. 6 
services are sampled. The sample of individuals contains 5 
persons, individual number 2 having been “caught” by two 
different services. Using formula (4), the weights of the 
individuals sampled will be equal to: 


xs é, 1 ye = - 1 
Paps re + Ws), Ws = Wig, We = W., We = 3 Wo- 


Accomodation 


Figure 1. The arrows represent the links between the services and the 
individuals. The shaded services were sampled. They point 
to shaded individuals. Dotted lines represent the links 
reported by individual 7, which were not used to include 
the individual in the sample. 


If the services all have the same weight equal to 13/6 (for 
example, if the services had been selected by simple 
random sampling), the number of persons having used 
services during the survey is estimated by: 
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In this case where the variable being considered does not 
vary during the survey period, identifying the persons using 
the services does not affect the estimator bias. Consider an 
individual “caught” by two different services with weights 
w, and w,. In practice, this could produce two cases: 


— it is determined that this is the same individual; the 
weighting associated with this individual will be equal to 
(w, +w,)/r,.), and the expression corresponding to the 
individual in the estimator will be equal to 


yy, ws) tea 


— it is not determined that the individual has already been 
interviewed: two different individuals are counted; the 
weights associated with these individuals will be equal to 
W, / Tc) and w, / T,¢sy and the expression corresponding 
to these two pseudo-individuals in the estimator will still 
be equal to y, (w, +W2)/Pyy: 


Of course, this presumes that the information provided 
by the same person surveyed in two different locations/on 
two different days is the same, which is far from given. 

However, identifying individuals can be important in 
order to limit nonresponse (see section 7). 


4.2 Estimation of a Ratio 


Let us now suppose that we are interested in the esti- 
mation of the average Y, (see Formula (2)). Y, Can be 
estimated by the Hajek estimator, 


where N, = > Wy. 


4.3 Variance Calculation 


The variance of the estimators presented above is 
calculated in the classic manner provided that the reasoning 
is based on services. The calculation is still complex 
because it is a multi-stage design with unequal probabilities. 
To avoid underestimating the true variance, it is essential 
that all services be retained in cases where several sampled 
services point to a single individual. 


4.4 Comparison with Other Estimating Methods 


Having introduced “weight sharing” estimators, it is 
appropriate to consider an alternative estimating method 
where we will try to estimate directly the selection probabi- 
lities of individuals in the sample. (The weight sharing 
estimator is not a classic Horvitz-Thompson estimator : the 
weights of that estimator clearly depend on the complete 
service sample (see formula (4)). This method can appear 
more natural. However, we must make two comments: 
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- it is not reasonably possible to obtain the selection 
probabilities of physical persons without relying on the 
services that the individual receives, based on the infor- 
mation provided by the latter when visiting the various 
centres. Based on the previous expressions, we get: 


Prob(kes,)=Prob( |) #) 
isII(J); K(i) =k 


The Poincaré formula enables us to express this 
probability from single, double, triple, etc probabilities of 
inclusion of services. Except for the single inclusion 
probabilities, these are complex probabilities derived as 
they are from selections of unequal and without 
replacement probabilities. We cannot therefore hope to 
obtain a calculable expression for Prob (k € s,). In contrast, 
the weight sharing method is very simple to apply: 


— in amore structured manner, a problem comes from the 
fact that the selection probabilities of unsampled services 
are not known in advance because of the multi-stage 
sample. At the earlier stages, the selection probabilities 
depend on the previous selection. In our case, we do not 
know the use of the centres that are not surveyed. To 
obtain the selection probability of an individual, we must 
know the inclusion probabilities of all services that the 
individual receives. On the other hand, one of the 
strengths of the weight sharing method is that the weights 
of units obtained indirectly (in this case individuals) can 
only depend on the weights of units sampled directly 
(services). Lavallée (1995) points out this advantage of 
the method. 


5. ESTIMATION DIFFICULTIES AND 
PRACTICAL SOLUTIONS IN THE 
CASE OF A CONSTANT VARIABLE 


In the formulae that we have presented, knowing the 
links between individuals and the services universe is 
critical. However, these quantities are not known for several 
reasons: 


— a theoretical reason: because the data collection is spread 
Over time, and an individual interviewed at the start of 
the period cannot anticipate the services that he will use 
after the interview date (Note that data collection must 
necessarily be spread over time to ensure good coverage 
of the target population; synchronous collection, even if 
technically possible, would not capture the whole target 
population but only the persons using the services on that 
date); 


— practical reasons: because the memory of the person 
interviewed becomes questionable after a few days, and 
because detection by the interviewer or the designer of 
the survey of the services provided in centres not 
belonging to the survey frame is very difficult. 


In practice, it is therefore impossible to estimate without 
bias a total of interest over the period of the survey (one 
month) without making assumptions at the outset (see 
Section 5.3). 


5.1 ‘Average Day” and “Average Week” 
estimations 


This forces us to look at quantities that bring into play 
links over a short period, for example, a day or week. The 
population of persons who use the services in the survey 
field on a given day j is P,=u,,P Let us now 


ClECEOC ste 


introduce the following quantities that relate to day j: 


o- Seay: 


keP, 


N,=)) 1 =card(P,). 
keP, 


If t=card(J) is the number of days in the survey 
reference period, we define the following parameters of 
interest: 


— the total of y in the population of persons who use the 
services in the survey field on an “average” day, as 
follows: 


@-+y 6, (5) 
T j=l 


A specific case is the number of persons who use the 
cote in the survey field on an “average” day, 

=1/t De N;. 

va the sairie way, the average of y in the population of 
persons who use the services in the survey field on an 
“average” day is defined as: 


ie (6) 


Defining totals or averages for a given week or an 
“average week” follows the same principle. 

We can estimate these parameters by simply adapting the 
formulae in the previous section, noting that the r, (J) must 
be replaced by the number of services in the survey field 
that the person sampled received on the day (or week) of 
the survey. 

Note that s, is the sample of persons interviewed on day 
Jj, 7,(j) the number of services in the universe received by 
individual k on day j only, and s,(j) the services sampled 
on day j that link to individual k. 


©; will be estimated by ©. = » YW ps 


kes, 
y W.. 
i 


where w, = —— 
rj) ies, (Jj) 
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Here, the weights of the individuals depend on the day /. 
(But not the weights of the services, w,, which are set one 
time for all (if there are no nonresponses, this would be the 
inverse of the selection probabilities of services)). The 
following analogy is useful to convince oneself of the 
difference between © et Y,: Consider a service window 
where everyone who comes must fill out a file. Y, corres- 
ponds to an approach where the person fills out a file the 
first time that he arrives at the window and does not fill one 
out on subsequent visits; the “average day” case 
corresponds to an approach where everyone who arrives at 
the window fills out a file, regardless of whether he has 
come to the window on some other day or not. At the end of 
a week, for example, the analysis of the characteristics of 
the persons who filled out the files will be very different in 
the two cases: in the second case, persons who come to the 
window often will be over-represented compared to the first 
case. It is possible to formalize this approach. We refer 
interested readers to Ardilly and Le Blanc (1999). 


5.2 Practical Estimation of the Links with the 
Survey Frame 


Even if we restrict ourselves to estimating “average 
week” and “average day” quantities, it is not generally 
possible to determine the links with the survey frame on a 
given day (much less a given week or over the whole of the 
survey period). 


5.2.1 


To share the weights, we must estimate the links relating 
to the survey day; the situation that presents the most 
problems is that of persons interviewed at noon in a centre 
that provides meals; we do not know which centres (meals 
and/or accommodation) these persons will use that same 
evening. One option not retained by the INSEE survey 
designers is to include in the questionnaire questions of the 
type “Where will you eat (or sleep) this evening?”. The 
answers can be used to determine the links. Of course the 
issue is whether the answers to these questions reflect the 
true links and whether the nonresponse rate for the question 
would be too high. From a more statistical standpoint, 
(hypothesizing that there is a certain regularity of 
behaviour) we could use information relating to the same 
time interval on the day before the survey. The correspon- 
ding links are undoubtedly reasonable approximations of 
the actual links. The practical problem relates to the 
possible difference in use of the centres depending on the 
day of week: for example, some centres are not open on 
weekends and others are open only on specific days. 


5.2.2 


To share the weights, we retain all the links relating to 
the week. Clearly, the first option described in 5.2.1 cannot 
be used. For a given week estimations, we can use, as an 
approximation of the services used on day j following the 
interview date, the services used by the individual on day 


"Average Day” Estimation 


“Average Week” Estimation 
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(j - 7). This is consistent if we assume that there is a 
certain pattern to the services used depending on the day of 
the week. This approach would mean that the calendar 
week would be replaced in estimators by a sliding week, 
that is, the last seven days beginning on the date of the 
interview. This is the option that was used for the survey, 
the questionnaire having been designed to collect the links 
over the 7 days preceding the interview. 


5.3. Estimation Over the Whole of the Survey Period 


It may seem that estimating totals and averages for the 
population P(/) is one of the survey’s objectives. This 
estimation calls on the links between individuals sampled 
and the services in the survey field during the whole of the 
data collection period, which are not known. This means 
that we have to model the evolution of the links beyond a 
week or, what amounts to the same thing, model the use 
behaviour of the individuals in the centres. 

The solution is not simple. For example, the hypothesis 
that comes to mind is 


Vk,r, (J) =A.r,(S) (7) 


where A is the number of weeks of the survey and r,(S) is 
the number of links for individual k with the services of the 
survey field during a week S, leads to estimators for the 
whole of the period that are identical to the estimators for 
an average week. In effect, an “average week” estimator 
weights individual k by 


WwW; 


igs, (J) A.r,(S;) 


where S;, is the week during which he received service i and 
s,(J) is the sampling of services that link to individual k, 
whereas a theoretical “whole period” estimator weights the 
individual k by 


Ww. 
i 


ies) 1,V) 


Equation (7) is therefore an adequate condition of 
equality of these estimators. This condition is satisfied in 
particular when for any j and any k 


r,J) =card J).r,(7 ) (8) 


that is, when the number of daily links does not depend on 


a 
This hypothesis is definitely too strong. To expand on 


this point, we will have to use the data provided by the 
survey itself on the behaviour of the individuals with 
respect to use of the centres. 

The most sought after figure of the survey — in the 
French context — is undoubtedly an estimate of the size of 
the “homeless” population, that is, an estimation of the size 
of P(J). In addition to the issues regarding counting the 
links that have already been discussed extensively, this 
estimation runs up against several inadequacies in the 
survey frame as well as the indirect nature of the sampling. 
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— The risk of overlooking certain structures when identi- 
fying the centres is significant. Even with an exhaustive 
inventory, the gap between when the inventory is esta- 
blished and the survey itself takes place makes it likely 
that new unidentified centres will appear in the survey 
frame. This can introduce a bias to the extent that some 
individuals who might use these structures would not use 
any other service in the survey frame. (We might also 
expect those in charge of certain centres to refuse to 
cooperate: for the INSEE survey, there was virtually no 
refusal by the institutions (less than 1% refusal rate). This 
was due largely to consideration awareness building at 
the time the centres were identified and just before the 
survey.) Further, the lack of bias depends on a correct 
calculation of the links; use of centres not included in the 
frame should not be counted in these links. 


— Individuals who use the centres only outside the “classic” 
hours (those in which we have the means to count the 
services) are outside the survey frame. (Counting them 
would create significant on-site implementation 
problems.) 


— Another source of bias can come from the careful 
counting of the total number of services provided in the 
centres during the survey, these numbers being used to 
calculate the probability of a service being sampled. For 
budget reasons, one person only counted the services and 
did the sampling, a situation that could create problems 
of rigour in the sampling if there is confusion in the field. 


— In terms of the concepts, the only remaining problem was 

that the survey had to take place over a month and that 

. the target population may have changed during that 
period. 


The estimation of the size of the population is therefore 
particularly fragile. For this reason, we can expect any 
errors to be larger for the totals than for the averages. 


6. ESTIMATION IN THE CASE OF VARIABLES 
OF INTEREST THAT ARE NOT CONSTANT 
OVER THE SURVEY PERIOD 


Some of the survey’s variables of interest depend on the 
observation date and therefore are not constant over the 
survey period. This can be the case with answers to 
questions dealing with the day before the interview, for 
example “How many meals did you have yesterday?”, 
“How many times did you sleep in the street last week?”, 
etc. The questions on links also fall into this category. It is 
therefore important to determine the extent to which we can 
adapt the earlier formalism to estimations involving this 
type of variable. In other words, where y is such a variable 
of interest. 

If we go back to expression (3), it is easy to see that the 
constancy of y, during the survey period is the condition 


that makes it possible to factor y, and to reveal the links 
r,(J). From this we can deduce that the above type of 
calculation is always valid for estimations covering shorter 
periods than the period for which the y, are constant. 

This means that for variables that are constant for a day, 
we Can appropriately use the “average day” estimators. For 
variables that are constant over the week, we can use the 
“average day” or “average week” estimators. 


7. ADJUSTMENT FOR TOTAL 
NONRESPONSE 


To describe the operation fully, we still need to explain 
how to move from a set of inclusion probabilities (and thus 
initial weights of services included in the sample) to a set of 
weights on respondent services. Some people will agree to 
the interview, others will not. We will refer to services in 
the first case as respondent services and those in the second 
case as nonrespondent services. The usual adjustment 
methods for total nonresponse can be applied. We suggest 
a nonresponse adjustment by homogeneous subgroup (for 
a description of the method, see for example Hambaz and 
Legendre 1999). 

In reality, the main problem relates to the fact that there 
is no survey frame of individuals and thus no advance 
information on nonrespondents. In a world that is likely 
very heterogeneous, this is a considerable handicap. We 
therefore have to model the service response behaviour. We 
know from the test surveys of the INED (Institut National 
des Etudes Démographiques) that nonresponse varies 
widely depending on the type of centre (Firdion and 
Marpsat 1997). Other variables in the survey frame can be 
used to build homogeneous groups (day of the week, period 
of the day, groups of population centres, ...). 

A reweighting of the respondent services produces 
weights for the respondent services of the type 

w, = 1/6,m,, where 

m, is the probability of inclusion of service i in the 

sample 

5, is the probability estimated after the fact that service 

i will result in a response. 

This provides us with a set of weights for the respondent 
services. 

In fact, some of the nonresponses come from the fact that 
the same individual is sampled several times: obviously, an 
individual who is sampled twice might respond the first 
time but not the second . (The frequency of occurrence of 
this event was not known at the time of writing this paper.) 
The second selection therefore produces a “false non- 
response”. If this is not detected, the total nonresponse 
adjustment procedure leads to an incorrect reweighting, 
when the true value can be obtained from a questionnaire 
that has already been completed. To avoid this problem, the 
interviewer tries to find out the reason for the refusal and 
must check off a specific box when the individual states 
that he has already been interviewed. In this situation, the 
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interviewer collects some information, including the first 
name and the date of birth, that can be used to link this 
questionnaire to the questionnaire that has already been 
completed. (The ideal situation would be to have an 
identifier for the respondents. This approach was not used 
because of confidentiality requirements and consideration 
of the reaction of the persons interviewed to such a 
measure.) However, in the field, it can be difficult to obtain 
a reason for refusal. Even if a reason is given, problems can 
occur. (It is hard to verify that a person who states that he or 
she has already been interviewed has in fact been inter- 
viewed. Even if the person is showing goodwill, he may 
have been interviewed a few days earlier for a completely 
different survey than the INSEE survey.) 


8. CONCLUSION 


In this article, we show how the weight sharing method 
can be used to weight the survey conducted by the INSEE 
in order to better understand homeless persons. The method 
has many advantages. It makes it possible to work on a file 
of individuals, that is, on the natural statistical units used in 
the definition of the parameters of interest. Simple to apply, 
it also makes it easy to move from one reference period to 
another (“average day”, “average week” estimation). 
Operations following to the survey, such as the nonresponse 
adjustment and the calculation of variance can be carried 
out in a traditional framework because they are done on 
sampled units (services), for which the selection 
probabilities are known, and not on individuals, for which 
the selection probabilities are not known. We show that a 
crucial quality criterion of such a survey is reliable data 
collection on use of services by the persons interviewed. 
Without these data, it is not possible to weight the survey. 
The weight sharing method appears to be a good compro- 
mise for a survey in which the purpose is not simply to 
count a population but to better understand it through the 
use of a questionnaire. Other alternative methodologies 
could be used for a survey aimed simply at determining the 
size of the homeless population. The first such methodology 
uses capture- recapture techniques to determine the size of 
animal populations (see for example, Pollock, Turner and 
Brown 1994). These techniques cannot be easily applied to 
a population that is often suspicious of any attempt to 
identify it, which they perceive negatively. Another 
technique is that of “snowball” sampling, which involves 
finding individuals of interest through the intermediary of 
individuals already sampled (Franck and Snijders 1994). It 
relies on a system of mutual knowledge of persons, who are 
probably illusive in the community. These methods always 
run up against the issue of the identifying individuals. In 
our case, the only places where it is possible to find the 
persons we are seeking are the centres: it is essential that we 
work through the centres. 
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APPENDIX 1: 
THE WEIGHT SHARING METHOD 
APPLIED TO THE PROBLEM 


This appendix briefly presents the principle of the weight 
sharing method. For a more complete discussion, the reader 
may consult Lavallée (1995) or Deville (1999) whose 
notations we have used. 


1. We have a population U of n units, and a population V of 
m units. The units of U are services in the survey field. 
The units of V are persons who used at least one service 
during the survey period (otherwise expressed in the 
present case as V = P(/) with the previous notations). 


2. It is assumed that there are links between the units of the 
two populations. These links can be written in the form 
of a matrix 

(i) sian 
l<sk<m 


where r,, = 1 if unit k of Vis linked to unit i of U, r, =0 
otherwise. In this case, the links connect the services to 
the persons who used these services: r,, = 1 if person k 
used service i of U, r, = 0 otherwise. 


3. All units of U have at least one link to a unit of V. 
Clearly, that is achieved here by definition of population 
V. Further, in this case, each unit of population U points 
to one and only one unit of V. 


In general, we are interested in the total of a variable of 
interest y in V, 
Y= Ss Ye 


keV 


If, for example, we use y = 1, the total of interest is the 
number of persons who used a service in the survey field 
during the month of the survey. 

We can write 

apa 


ieU 
The cidentity (y=, <7 (0, 10, ) 0), Makes Mt 


possible to define for any ieU the variable 
Z,=Lyey (_/7,) Y, Which gives: 


Z= z= y, = ¥. 


isU keV 
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Let us now assume that we have a sample s,, from the 
population U, which is associated with a set of weights 
(w,); a . This sample implicitly defines a sample in V, sy, 
specifically 


Sy= (KEV SLES pty = 1): 


We assume that we collected the r, for all ke sy, that is, 
that all links between individuals and the universe U are 
known (this point is fundamental). 

The total Z = Y is estimated by Z = Dae W,Z;.- 


And consequently, if the weights are Pape’ (that is, 
set so that Z is without bias), Y estimates Y without bias.. 

Wecan rewrite Z=L, wil, Ty Vy! = Y. 

The second equation ithpacts only s,, by definition and 
therefore Y =D, Je (L, w,r,/t,) = DM pe where we 
have written for “Ail kes: 


> Wile (9) 


We can work directly on the individuals sampled. In our 
case, r, is the number of links, that is, the number of 
services used by the person interviewed during the survey 
reference period. It is the quantity that is written 7, (J) in 
the previous sections, the dependence on J being intended 
to remind that links affecting the weight can vary by the 
type of estimator (“average day”, “average week’’) 
considered. This number is derived from the use data 
collected in the survey. 


APPENDIX 2: 
SUMMARY TABLE OF EXPRESSIONS 
J All days in the survey reference period 
1 = card(J), number of days in the reference period 
P(J) population of interest, all persons who used at least 


one service in the survey field during the reference 

period 

= card (P(J/)), size of the population of interest 

G: all centres in the population centre, denoted by 

index c 

all services provided in centre c on day j during time 

interval t, denoted by index i 

all services provided in the population centre on day 

j during time interval t 

P.,, all persons who visit centre c on day j during time 
interval t, denoted by index k 

PR. all persons who visit one of the centres in the 
population centre on day j during time interval t 

P. all persons who use services in the survey field on 
day j 

y variable of interest 

Y, total of variable y in the reference population 


Y, average of y in the reference population 


Il(J) all services provided during the reference period in 
all centres in the survey frame 

r,(J) number of services provided to individual k during 
period J in all centres in the survey field, or 
“number of links” 


Sh sample of services 

Ww; weight associated with the services sample 

Sp sample of individuals, all individuals who received 
sampled services 

w, weight associated with the sample of individuals 

oF total of yin P j 

N; = card (P i) 

2) total of y “an average day” 

N number of persons on “an average day” 

Vy = =, average of y “on an average day” 

r,(j) number of services received by the individual k on 
day j only 

5; sample of persons interviewed on day j 

s,(j) all services sampled on day j that point to individual 


k 


s,(J) all services sampled during period J that point to 
individual k 
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In This Issue 


This issue of Survey Methodology contains papers on a variety of topics touching on coverage 
issues, nonresponse, imputation, survey designs, survey weighting and analysis of data from 
complex surveys. 

In the first paper of this issue, Blenk and Stasny develop a weighting adjustment in order to 
reduce the coverage bias in telephone surveys while controlling the increase in variance due to 
weighting. The weighting adjustment is applied to transient households, which are households 
moving in and out of the telephone population during the year. It is assumed that the transient 
telephone population is representative of the non-telephone population. The weighting adjustment 
proposed is based on propensity scores for transience obtained using a logistic regression model. 
The proposed method and several alternatives are compared using data collected from a survey of 
distressed and non-distressed regions of Kentucky, Ohio, and West Virginia. 

Mariano and Kadane use the information on the number of calls in a telephone survey as an 
indicator of how difficult an intended respondent is to reach. This permits a probabilistic division 
of the nonrespondents into those who will always refuse to respond and those who were not 
available to respond in a model of the nonresponse. It also permits an evaluation of whether the 
nonresponse is ignorable for inference about the dependent variable by incorporating the information 
on the number of calls into the model. These ideas are implemented on data from a survey in 
Metropolitan Toronto of attitudes toward smoking in the workplace. The results reveal that the 
nonresponse is not ignorable and those who do not respond are twice as likely to favor unrestricted 
smoking in the workplace as are those who do. 

In his paper, Hidiroglou unifies the nested and non-nested cases found in the double sampling 
theory. The nested case, also known as two-phase sampling, corresponds to the traditional case in 
which a first-phase sample is initially taken so that additional information may be collected. This is 
followed by a second-phase sample taken within the first one, which contains the variables of 
interest. The non-nested case reflects a situation in which both samples are selected independently 
from the same frame or possibly from different frames. Using the generalized difference, an 
estimator is proposed for both cases, and an optimal estimator that minimizes variance is developed. 
Variance estimation is also discussed for both cases. Numerous examples of surveys conducted at 
Statistics Canada illustrate the unification of both cases. 

Lavallée and Caron investigate the problem of producing estimates when using record linkage 
methods to link two populations together. In particular, they consider the problem of producing 
estimates for one of the populations using a sample from the other one, assuming the two 
populations have been linked together. The Generalized Weight Share method is adapted to take 
into account the linkage weights in three different ways: (1) all links where the linkage weight is 
non-zero; (2) all links where the linkage weights are greater than a given threshold; and (3) the links 
are randomly chosen. These proposed estimators are compared with the classical approach through 
a simulation study. 

Merkouris considers the problem of producing cross-sectional estimates with data collected from 
multiple panel surveys. Coverage of the cross-sectional population maybe incomplete due to 
individuals leaving or entering the population after the selection of the panel. By recognizing that 
a repeating panel survey is a special type of multiple frame survey, Merkouris is able to propose 
weighting strategies suitable for various multiple panel surveys. These weighting procedures can 
be used to combine information from the multiple panels to produce cross-sectional estimates that 
take into account the dynamic character of the multiple panel design. 

Marker investigates survey design strategies to improve the quality of direct small area estimators, 
thus reducing the need for indirect, model-based estimators. Factors considered include stratification 
and oversampling, combining data from repeated surveys, harmonizing across different surveys, 
supplemental samples, and improved estimation procedures. 


120 In This Issue 


In their paper, Saigo, Shao and Sitter address the important problem of variance estimation under 
imputation for missing data. In their paper, they propose a bootstrap method that works for both 
smooth and non-smooth statistics, even for the case where the number of sampled clusters is small. 
This improves on their previously proposed bootstrap method which could suffer from serous 
overestimation when the number of sampled clusters is small. In addition to a bootstrap method, 
Saigo, Shao and Sitter also propose a repeated Balanced Repeated Replication method that captures 
the imputation variance in the presense of random imputation. These methods are illustrated through 
a simulation study. 

Bellhouse and Stafford consider nonparametric local polynomial regression as an exploratory data 
analysis tool for data from complex surveys. They consider a single continuous regressor variable 
x, which is binned into a finite number of possible values, which may correspond to the precision 
of measurement of x, but may also be chosen otherwise. Point estimates of the local regression 
function, and associated variance estimates, are developed. The method is illustrated with an 
analysis of body mass indices from the Ontario Health Survey, and the nonparametric estimates are 
compared to those obtained from a parametric model. 

In the final paper of this issue, Silva and Smith use a state space approach for modelling of 
compositional time series using data from a repeated complex survey. A compositional time series 
is a multivariate time series of proportions constrained to add to one at each time point. They first 
transform the data using an additive logistic transformation, and then model the transformed series. 
Estimation methods based on the Kalman filter are developed and then applied to data from the 
Brazilian Labour Force Survey. The Kalman filter also provides model-based estimates of variance 
and confidence limits for the transformed series. Estimates of trends and seasonal effects are 
compared to those obtained using X-11 ARIMA, and found to be generally smoother since they 
explicitly account for sampling errors in the raw estimates of the series. 
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Using Propensity Scores to Control Coverage Bias in Telephone Surveys 


KRISTIN BLENK DUNCAN and ELIZABETH A. STASNY’ 


ABSTRACT 


Telephone surveys are a convenient and efficient method of data collection. Bias may be introduced into population 
estimates, however, by the exclusion of nontelephone households from these surveys. Data from the U.S. Federal 
Communications Commission (FCC) indicates that five and a half to six percent of American households are without phone 
service at any given time. The bias introduced can be significant since nontelephone households may differ from telephone 
households in ways that are not adequately handled by poststratification. Many households, called “transients”, move in 
and out of the telephone population during the year, sometimes due to economic reasons or relocation. The transient 
telephone population may be representative of the nontelephone population in general since its members have recently been 
in the nontelephone population. 


This paper develops a weighting adjustment for transients in an effort to reduce the bias due to noncoverage while 
controlling the increase in variance due to weighting. We use a logistic regression model to describe each household’s 
propensity for transience, using data collected from a survey of distressed and non-distressed regions of Kentucky, Ohio, 
and West Virginia. Weight adjustments are based on the propensity scores. Estimates of the reduction in bias and the error 
of estimates are computed for a number of survey statistics of interest, using the propensity based weight adjustments and 
several alternative weight adjustments. The error in adjusted estimates is compared to the error of the standard estimate to 


assess the effectiveness of the adjustment. 


KEY WORDS: RDD survey; Weight adjustments; Non-sampling error. 


1. INTRODUCTION 


The telephone is a standard mode of communication in 
today’s world, and hence it is extremely useful for 
conducting surveys. Telephone surveys have come into use 
more and more as a growing percentage of people have 
phone connections. Most people who belong to the 
population that a survey seeks to make inferences about, the 
survey’s target population, can be reached by phone. 
Therefore, the sample is drawn from the set of all people in 
households reachable through residential phone numbers. 
However, this sampling frame excludes all the people 
without telephone service who may compose a significant 
portion of some populations. It is currently estimated that in 
the United States, five and a half to six percent of house- 
holds are without telephone service at any given time 
(Belinfante 2000). People without phone service tend to be 
different from people with service, particularly with regards 
to economic factors (Smith 1990). Results of the survey will 
not truly reflect the entire population if these differences are 
significant on matters of importance to the survey. The 
coverage bias is particularly troublesome in surveys that 
examine subgroups of the population with lower telephone 
penetration rates. These groups include people in lower 
income households and people who have not obtained a 
high school degree. 

Poststratification on demographic variables associated 
with telephone coverage is helpful for reducing the cover- 
age bias, but it does not completely solve the problem 
(Massey and Botman 1988). Another way to account for 


this coverage bias is to let people who are currently without 
telephone service be represented by people in the survey 
who have not had continuous service recently. People 
whose phone status has changed within the last year are 
referred to as transients. Transients move in and out of the 
telephone population, possibly for economic reasons, or 
service interruptions during relocation. Transients who 
currently have phone service may be good representatives 
of the nontelephone population because they are included 
in the sampling frame, yet they have recently been part of 
the nontelephone population. 

A weighting adjustment suggested by Brick, Waksberg 
and Keeter (1996) uses transients in the sample to represent 
the nontelephone population. They use data from the U.S. 
Current Population Survey (CPS) to estimate unbiased 
weighting class adjustments for the transient respondents in 
their survey. Frankel, Ezzati-Rice, Wright and Srinath 
(1998) also employ this weighting class adjustment, and 
consider two similar adjustments. Brick, Flores Cervantes, 
Wang and Hankins (1999) and Frankel, Srinath, Battaglia, 
Hoaglin, Wright and Smith (1999) evaluate these adjust- 
ments using surveys that ask questions about telephone 
service, but that are not subject to telephone coverage bias. 
These studies found that employing weight adjustments 
based on transient status generally led to improved 
estimates. 

This article studies an alternative method for computing 
a transient weight adjustment. Our method develops a 
model for predicting transience using demographic 
variables. The weight adjustment is then based on the 


' Kristin Blenk Duncan and Elizabeth A. Stasny, Department of Statistics, Ohio State University, Columbus, OH 43210-1247. 


122 Duncan and Stasny: Using Propensity Scores to Control Coverage Bias in Telephone Surveys 


respondent’s propensity for transience. We also compare 
our propensity method to the method suggested by Brick 
et al. (1996), and to a response probability method where 
the weight adjustment is based on the length of interruption 
in telephone service. 

We use data from the Appalachian Poll, an RDD tele- 
phone survey conducted by the Ohio State University’s 
Center for Survey Research during June and July of 1999. 
The survey was sponsored by The Columbus Dispatch, and 
compared distressed and non-distressed regions of 
Kentucky, Ohio, and West Virginia. The study gathered 
information on quality of life issues and perceptions about 
the Appalachian regions, and also posed a series of standard 
demographic questions. A stratified sample was used, and 
just over 400 surveys were completed from each of the six 
strata (Appalachian and non-Appalachian regions of Ohio, 
Kentucky, and West Virginia). The poll targeted English 
speaking adults, 18 years of age or older, residing in the 
three states. Coverage bias is of particular concern in this 
survey since telephone coverage rates are lower than usual 
in the distressed Appalachian regions. 

In section 2, we report on the literature describing tele- 
phone and transient populations. In this section we also 
explore differences between these groups in our data, 
illustrating the concern about coverage bias. Section 2 ends 
with our proposed model for predicting transience. Section 
3 details the various weighting procedures. In section 4 we 
discuss the trade-off between bias reduction and increased 
variance from adjusted weights, and compare the weighting 
schemes. The final section summarizes the findings. 


2. NONTELEPHONE AND TRANSIENT 
TELEPHONE POPULATIONS 


The target population for a telephone survey can be 
categorized by telephone status into four groups: contin- 
uous service households, transient households which are 
currently with service, transient households which are 
currently without service, and chronic nontelephone house- 
holds. We need to know something about the size of each 
of these groups in order to account for coverage bias in the 
survey. Data from the FCC is useful for examining long 
term trends in the size of the nontelephone population. Not 
as much is known, however, about the short-term changes 
in phone coverage. 

Keeter (1995) used panel surveys to study the dynamics 
of the transient phone population. In the March 1992 and 
1993 CPS, it was found that 94.1% of households in the 
sample at both times had a phone at both time points, 2.6% 
at neither point, and 3.4% had a phone at one interview, but 
not the other. Fifty-seven percent of respondents who 
reported having no phone at either interview were transient. 
If the measurements could be taken continuously, rather 
than at two points in time, even more households would be 
labeled transient. Keeter concludes that, “‘a sizable minority 


of nontelephone households, at the least, have recently been 
in the telephone population or are soon to join it. Such 
transient households constitute a measurable segment of 
telephone households and thus can provide data to charac- 
terize the nontelephone population,’ (Keeter 1995, 
page 201). The same article asserts that, “Transient tele- 
phone households are much more like nonphone house- 
holds than those with continuous service,” (Keeter 1995, 
page 209). This conclusion is based on formal tests using 
demographic variables from the CPS. Data from the 
National Survey of America’s Families presented in Brick 
et al. (1999) supports Keeter’s findings. Since transients 
make up a nontrivial proportion of the nontelephone popu- 
lation and transients are more similar to the nontelephone 
households than they are to continuous service households, 
it is reasonable to use data from the transients in the sample 
to attempt to reduce coverage bias. 

In the Appalachian Poll, 140 of the 2,463 respondents, 
or 5.7%, replied positively to the question, “During the last 
twelve months has your household ever been without 
telephone service for one week or more?” These respon- 
dents are categorized as transients. In the Appalachian 
regions, the transience rate is 7.4% while the rate is only 
3.9% in non-Appalachian regions. 

Table 1 compares transient and nontransient households 
from the sample in regards to selected variables. The large 
differences between the two populations illustrate the need 
for bias reduction. People who live in transient households 
are much younger, have lower incomes, and they are less 
likely to be employed full time. They also have less access 
to health insurance and computers. 


Table 1 
Selected Characteristics of Nontransient and Transient 
Households 
Characteristics Nonstransient Transient 

Median Age 47.0 Bie 
Household income Less than 27.8% 60.0% 
$20K 
Employed full-time or retired 55.0% 34.5% 
No health insurance 12.7% 30.0% 
Owns or is buying residence 79.4% 61.4% 
Computer in home 47.4% 26.4% 
Not enough money for food 12.3% 42.9% 


Note: Statistics are based on unweighted frequencies in the sample 
which oversampled from the Appalachian regions, and thus are not 
representative of population quantities. 


A model for transience. Using the Appalachian Poll 
sample, we develop a logistic regression model to predict 
transience with demographic variables. The independent 
variables used to predict transience are age, employment 
status, race, income, and region. The model is described in 
the Appendix. Education and tenure are also good pre- 
dictors of transience, but they are strongly correlated with 
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the other variables in our model, and thus, we chose not to 
include them. For a comparison of models that predict tele- 
phone coverage, see Smith (1990). We will use our model 
in the propensity weighting adjustment described in the 
following section. 


3. WEIGHT ADJUSTMENTS 


We consider several weighting schemes that attempt to 
account for the coverage bias inherent in telephone surveys. 
Each of these schemes is compared to the actual weighting 
procedure used for the Appalachian Poll. In the standard 
procedure, a base weight was calculated for each respon- 
dent. This adjustment is (# adults in household) / (# voice 
telephone lines), or the inverse of the respondent’s probabi- 
lity of being in the sample. Then weights were raked in each 
of the six strata to agree with 1990 Census proportions for 
age group, education level, and gender. Finally, the weights 
were scaled to the sample sizes within the six strata. 


3.1 Length of Disconnect 


Respondents to the Appalachian Poll who replied “‘yes” 
to the question about an interruption in phone service of one 
week or longer were then asked how many days they were 
without service in the last year. A simple approach to the 
coverage bias problem is to give transients a weight adjust- 
ment inversely proportional to the fraction of the year that 
they were with service. For example, a person who has only 
had service for six months out of the last twelve receives a 
weight of two, thus representing himself and one other 
person in the population with a six-month disconnect who 
is currently without service. 

This naive approach is included in the analysis for 
comparison with other schemes. It is referred to as the day 
scheme (DAY). Weight adjustments are calculated as 
365/(365 — # days without service). This weight adjustment 
is applied after the base weight described above, and before 
the weights are raked. 

While this approach is logical, it is not practical for 
controlling variance. It is usually considered undesirable to 
use weighting factors larger than three. In fact, for many 
large surveys conducted by the U.S. Census Bureau, if 
weighting factors are larger than two, respondents are 
merged into larger groups and a group weight is calculated 
in order to obtain lower weighting-adjustment factors; see, 
for example, CPS (1978). 

This simple approach becomes more practical when 
respondents are grouped by the length of their interruption 
in service. In a scheme called day group (DAYG), tran- 
sients are grouped into quartiles across the entire sample by 
length of interruption in phone service. These quartiles 
correspond to interruptions of one week, more than one 
week but less than three weeks, three weeks to two months, 
and more than two months. The weight adjustment for each 
group is 365/(365 — avg. # days without service), and it is 
also applied after the base weight, prior to raking. This 
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grouping procedure is helpful for reducing the variance 
caused by extremely long interruptions. 


3.2 Weighting Class Adjustment Scheme 


Brick et al. (1996) also implement a response probability 
adjustment to reduce coverage bias. Under their procedure, 
they partition the target population into the four com- 
ponents described in section 2: t, is the number of persons 
living in continuous service households; L is the number of 
persons living in transient households that currently have 
service; t, is the number of persons living in nontelephone 
households that have not had any service in the last year; 
and f, is the number of persons living in transient house- 
holds that are currently without service. The response 
probability model the authors use assumes that t, = 0. With 
this assumption, an unbiased weight adjustment is 
A =(t,+1t,)/t,=1+(t,/t,), the inverse of the proportion of 
the transient population that currently has service. Unfortu- 
nately, these population quantities are unknown and must 
be estimated. Following the lead of Brick et al., we use CPS 
data to estimate Pak, the number of persons who cur- 
rently have service, and t,; call these estimates Pe 0 ADO tr, 
respectively. From the Appalachian Poll, separate estimates 
of t, and t, are available; designate these estimates as i 
and t,, respectively. Since the estimates come from differ- 
ent surveys, ratios are used in the weight adjustment, and A 
is estimated by 


(1) 


Some persons are more likely to live in nontelephone 
households than others, so Brick et al. classified transients 
into cells based on characteristics associated with not 
having a telephone, and computed the weight adjustment 
for each cell. Four classification schemes, which catego- 
rized respondents by either education or tenure, length of 
interruption, and race/ethnicity were considered. 

Brick et al. found schemes that classified respondents as 
transients if they had an interruption of one week or more 
to be superior to schemes that used a cut-off of one month, 
so for the Appalachian Poll data we use the one-week cut- 
off. Due to the small number of Hispanics in the 
Appalachian Poll sample, we do not categorize by ethnicity. 
Thus, for our analyses, the cell classifications for two 
schemes that use the method described by Brick et al. 
(1996) are defined as follows: 


BWKE — households that had a service interruption 
of one week or more within categories defined by 
education (less than high school, high school 
diploma, college diploma or above) and race (black, 
non-black); and 
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BWKT -— households that had a service interruption 
of one week or more within categories defined by 
tenure (own/other, rent) and race. 


The disadvantage of using these schemes in our study is 
that the estimates needed from the CPS are available by 
state, but not by region since the CPS does not sample from 
all counties. Persons in Appalachian regions are less likely 
to have telephones, but we cannot account for this with the 
available CPS data. Even when we consider statewide data, 
the sample size of the CPS is not large enough to get 
reliable values of ‘, in all of the cells. For example, in 1999 
the CPS did not sample any blacks with a college degree or 
higher who live in Kentucky and do not have telephone 
service. Thus, the weighting cell adjustments computed for 
use with the Appalachian Poll are based on CPS data from 
the three states combined. 


3.3. Raking Ratio Adjustment 


Lohr (1999) explains the use of raking ratio estimates to 
adjust for nonresponse in surveys. We propose a similar use 
of raking to account for coverage bias. We estimate the 
proportion of the population with continuous telephone 
service, and then use raking to allow transients in the 
sample to represent the portion of the population without 
continuous telephone service. 

The percent of households without continuous service is 
estimated by 


t, + te 
as Nace 5 (2) 


4 1 +h 


where fs i=1,2,4, is obtained from the FCC data. The 
first fraction estimates the proportion of households that 
currently have service, and the second fraction estimates the 
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proportion of nontransient households among households 
with service. Again, we assume that ie 0. The FCC gives 
telephone penetration rates by state, but not by region. Data 
from the 1990 Census does give penetration rates by 
county, but rates changed from 1990 to 1999. Therefore, to 
estimate the 1999 regional penetration rate, we maintained 
a constant ratio of percent of households without a phone in 
the non-Appalachian regions to percent of households 
without a phone in the Appalachian regions and adjusted 
the 1990 Census regional rates to match the 1999 state 
rates. Table 2 gives the data we used to compute the 1999 
state rates, and the resulting estimates. 

In a scheme referred to as transient raking, or TRAK, 
transient status is included as a control variable for raking 
along with age, gender, and education level. The totals we 
used for raking by transient status are given in Table 2. 


3.4 A New Propensity Weighting 


An estimated propensity score is sometimes used to 
create a weight adjustment to account for nonresponse in 
surveys where some variables are known for the non- 
respondents. For example, in a face-to-face household inter- 
view the interviewer knows the address of the non- 
respondent and may have information about the person’s 
race, gender, and age. A logistic regression model that 
describes propensity for response is developed, and 
respondents are assigned a weight of 1/), where p is the 
estimated propensity to respond (Little and Rubin 1987). 
This procedure gives higher weights to sampled households 
that are more similar to the nonrespondents. Since there is 
typically no data on the excluded nontelephone population 
in telephone surveys, a modified approach is taken to using 
a propensity score. We only adjust the weights for the 
transients since they will represent the missing part of the 
sample: weights for nontransients remain unadjusted. The 


Table 2 
Computation of Transient Status Raking Totals 


Ap Non-Ap Ap 


Appalachian Poll Data 
Sample Size 
# transients in sample 
Percent of sample without cont. service 


Census and FCC Data 
1990 State % no phone 
1990 Region % no phone 
1999 State % no phone 
Percent of state pop. living in region 


Estimates 
Ratio of Non-Ap to Ap noncoverage 
Estimated 1999 region % no phone 
Estimated % of pop. without cont. service 
Desired # of transients in sample 


Kentucky Ohio West Virginia 
Non-Ap Ap Non-Ap 
412 407 413 405 411 415 
38 19 18 1134 36 16 
9.2 4.7 4.4 Br 8.8 3.9 
10.2 10.2 aT 4.7 10.3 10.3 
19.1 8.2 WRG 4.5 14.3 8.4 
6.7 6.7 32 5.2 73 W58) 
18.6 81.4 2.6 97.4 31.8 68.2 
0.429 0.429 0.385 0.385 0.587 0.587 
PAS) 5.4 13.0 5.0 10.1 6.0 
20.6 9.8 16.7 8.1 18.0 9.6 
85 40 69 33 74 40 
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weight adjustment for transients is 1/(1 -)), where p, the 
estimated propensity for transience, is described by the 
model in section 2.1. Households with a higher estimated 
propensity for transience may be more representative of the 
nontelephone population and they receive higher weight 
adjustments. This adjustment is applied to the base weight, 
and the scheme is called propensity (PROP). 

Transience is not that common, and most estimated 
propensity scores are fairly low. In the PROP scheme, the 
average weight adjustment for a transient household is 
1.167. This adjustment is not large enough for transients to 
represent themselves and the entire nontelephone popu- 
lation. That is, when the weights are scaled to sum to the 
population size, the sum of the final weights for transients 
is less than the size of the transient population. To account 
for this under-representation, the propensity weight adjust- 
ment is applied, and then transient is used as a control 
variable for raking along with age, education, and gender. 
The estimated population sizes for transients are computed 
as in section 3.3. This weighting scheme is called 
augmented propensity, or AUGP. 


4. FINDINGS 


The analysis and comparison of the adjustment schemes 
presented here parallels the analysis performed by Brick 
et al. (1996). We first discuss the change in variance 
resulting from adjusting the weights to reduce coverage bias 
and present a statistic for measuring the relative variability. 
Then, the schemes are evaluated by comparing the variance 
of adjusted estimates to the mean squared error of the 
standard estimate. 


4.1 Changes in Variability 


The goal of the adjustment schemes is to decrease 
coverage bias while controlling variance. Adjustment of the 
weights to reduce the bias increases the variability of the 
weights, hence increasing the variance of the estimates. 
Kish (1992) gives a formula for measuring the increase in 
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variance due to unequal weights. Brick et al. (1996) refer to 
this expression as the variance inflation factor (VIF). The 
VIF can be written as 


VIF = 1 + [CV(weights)]’, (3) 


where CV(weights) is the coefficient of variation of the 
weights. A VIF ratio is computed to compare the VIF of a 
new weighting scheme to that of the standard weighting 
scheme. Table 3 gives VIF ratios for the six strata in the 
Appalachian Poll data under each scheme described in 
section 3. A VIF ratio of 1.12, for example, indicates an 
increase in variance of 12 percent over the variance using 
the standard weighting scheme. The VIF ratio values are 
reasonable for all schemes except the DAY scheme which 
sees an average variance increase of 300 percent. The VIF 
ratio values for our PROP scheme are all very close to one, 
suggesting that the PROP weight adjustments will not 
increase the variance of our estimates. 


4.2 Coverage Bias Reduction 


Estimates of seventeen population proportions using 
survey variables from the Appalachian Poll were calculated 
for the standard weighting procedure and for each of the 
seven adjustment schemes (see Table 4 for a list of the 
seventeen variables). WesVar software was used to 
calculate standard errors for these estimates by means of 
replication. We would like to assess the effectiveness of 
each scheme for reducing the coverage bias on these 
seventeen characteristics. Estimates from an independent 
source that are free of telephone coverage bias would be 
ideal for such an assessment. Unfortunately, such bench- 
marks are unavailable and some model assumptions are 
necessary in order to perform an evaluation. We assume 
that the weight adjustment procedures reduce the coverage 
bias. Thus the difference between the standard estimate and 
the adjusted estimate is considered to be an unbiased 
estimate of the decrease in coverage bias resulting from the 
adjustment. The assumption favors the adjusted estimates, 
considering them to be unbiased. 


Table 3 
Ratios of Variance Inflation Factor Due to Weight Adjustment 


Ratio of scheme’s VIF to standard weight’s VIF 


Region 
DAY DAYG 

Non-Appalachian Ohio 0.999 0.997 
Appalachian Ohio 1.480 1.016 
Non-Appalachian Kentucky 4.151 1.040 
Appalachian Kentucky 2.433 1.069 
Non-Appalachian West Virginia 6.331 1.027 
Appalachian West Virginia 2.935 1.085 
Scheme Average 3.055 1.039 


BWKE BWKT TRAK PROP AUGP 
1.004 1.023 1.063 0.999 1.061 
1.039 1.091 1.331 0.999 1.336 
1.018 1.054 1.030 0.999 1.029 
1.045 1.042 NIV, 1.003 1.145 
1.010 1.029 1.020 0.999 1.024 
1.058 1.053 1.116 1.005 1.119 
1.029 1.049 1.115 1.001 1.119 
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Table 4 
Estimated Reduction in Bias and Bias Ratio for Selected Characteristics 


Standard estimate 


Characteristic Estimate St. 
error 

Owns Home 
Non-Appalachian Ohio exes 3 0.6 0.5 0.5 12 
Appalachian Ohio 7154 28 4.4 0.6 0.6 a 
Non-Appalachian 68.6 3.1 7.2 0.8 0.9 1.8 
Kentucky 
Appalachian Kentucky SO. 29 0.8 0.3 1.3 
Non-Appalachian West 80.0 23 14.2 1.6 0.9 19) 
Virginia 
Appalachian West SLOT 22 8.2 0.7 -0.4 0.5 
Virginia 

No Health Insurance 
Non-Appalachian Ohio 7-3 1.7 0.0 -0.1 -0.6 -1.4 
Appalachian Ohio 1262 0.9 0.1 0.3 0.3 
Non-Appalachian 8.8 1.8 1.8 0.4 0.2 0.3 
Kentucky 
Appalachian Kentucky DD ean 3.4 0.1 -0.1 -0.2 
Non-Appalachian West 142° 2:1 -4.8 -0.5 -0.7 -1.0 
Virginia 
Appalachian West 246 2.5 2.5 -0.8 -1.7 -1.3 
Virginia 

Not enough Money for 

Food 
Non-Appalachian Ohio 10.8 1.9 -0.7 -0.6 -0.9 -1.6 
Appalachian Ohio NS PAs PAS) -4.7 -0.8 -0.6 -1.3 
Non-Appalachian 114 24 -3.3 -0.8 -1.3 -17 
Kentucky 
Appalachian Kentucky 20.2 2.4 -7.4 -2.3 -2.1 -2.1 
Non-Appalachian West 140 2.1 4.3 -0.1 -1.0 -1.4 
Virginia 
Appalachian West 16.4 2.0 15) -0.7 -1.0 -0.9 
Virginia 

Computer in Home 
Non-Appalachian Ohio 60.1 3.0 0.4 0.3 0.6 1.2 
Appalachian Ohio 40.0 3.0 12 0.2 0.3 0.8 
Non-Appalachian 445 3.0 6.7 0.9 0.8 1.1 
Kentucky 

~ Appalachian Kentucky DB) ie 26: 19 1.0 0.9 1.1 
Non-Appalachian West 46.2 26 7.6 0.6 Joi 1.2 
Virginia 
Appalachian West BOnl, 4.3 1.0 0.3 0.4 
Virginia 

Summary of Seventeen 

Variables 
Mean absolute value 0.032 0.005 0.006 0.009 
Median absolute value 0.022 0.005 0.006 0.011 


Estimated reduction in bias 
DAY DAYG BWKE BWKT TRAK PROP AUGP 


Bias Ratio 
DAY DAYG BWKE BWKT TRAK PROP AUGP 


14° 01 16 O28 02 02. 204.0) 20508 OO 20S 
32 Os eS 16” 02 OQ Te OR eri eo rs 
15 penODeNe ors 2 3Mme) OS 03: TEKCIE Bers CMON tg :5 
03 iu 00a R03 hou aa03 Ol 06 ee Ot a BOOuiia Oil 
146 O24 61 07 04 08.) 0.6. "Ol uo 06 
0.3 WiOLOy0:2 SOBINs 02) Mako ayy onmen Copia 
qt Oly eke VOn Ol O40 0.8) enon reo tae cd 
0:5 oar rre.s o4” ou OT Port hg 3H TOO O3 
0:0 VO leoul PO mo? Ow) VOR aolmntOOR F100 
0:8) 7-04 CUS EA) 00 OOF Od pe | 0:31) 0s OG 
Beenie ita 5: 2s 02 8 Ody 005 1 OG EEO Niet 07 
27 WDE O30 {0 OSs Lory) Oise geoae 
22 RST 3ee2d Coa To.) 205 (ROSIN Ron OKI 
233 WicOD\ tad KO). SOS woe 0Sy ence Komp Aes 
D6 BpkO4 eects alld: p03 GA0:5 he cO:ti pe Ol nO eens 
3.8, aa ae By CULO Ree eee Sigig= ogee eG 
1:7 poet Pate Dip [Or iro Se \0:7e be-0.8 BY 2a Sto9 
BD 05, 2G Eek epee eS Ie fee 
13 Sora Olin’ (0.1 0.261 0:4 CSN OLORN 05 
18 sete badi0 04 01 0:1) owhi0s3 up 00.6, dake 0-0 OF 
09 <0 2 a0 22. 03 03) WO tl OF ea Ola EOS 
23 OO, 8 08 04 C4 Oss. kOe e000 Og 
15 aa Uie 2107 © G02 04 OAL LOG TOF 9 06 
G2 POs 4.05 16 = 0.4 Op | p02 01h EOD 
0.013 0.002 0.014 1.396 0.235 0.620 0.412 0.885 0.075 0.885 
0.014 0.001 0.014 0.995 0.240 0.245 0.420 0.605 0.055 0.665 


Note: In addition to the four proportions listed in the table, the summary of seventeen variables includes worry about income, better off 
economically in the 1990’s, dissatisfied with own net worth, married, have children, unemployed, college graduate, in good or excellent health, 
serious illness in household, no family doctor, satisfied with own housing, very safe drinking water, and internet access in home. 


Using our assumption, we compare the estimate from 
each scheme to the standard estimate. The reduction in 
coverage bias is estimated by the difference between the 
standard estimate and the adjusted estimate. There are seven 
different estimates of the bias reduction, one for each 
scheme. The estimated reduction in bias is given by 

bop Pp (4) 
where b, is the estimated bias reduction using scheme i, p, 


is the standard estimate, and /, is the estimate from adjust- 
ment scheme i. Estimated reductions in bias for four 


characteristics by the six strata are given in Table 4 for each 
scheme. For the characteristics owns home, not enough 
money for food, and computer in home, the direction of the 
bias is fairly consistent across schemes and regions. 
Reassuringly, the bias is in the expected direction for these 
characteristics, with fewer people owning homes, more 
people not having enough money for food, and fewer 
people having computers in their homes, than is indicated 
by the estimates using the standard weighting scheme. For 
health insurance, the direction of the bias is mostly 
consistent across regions. The standard estimate is biased 
upward for Appalachian Ohio and non-Appalachian 
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Kentucky, and generally biased downward in the other 
regions. 

The absolute size of the reduction in bias by itself is not 
fully meaningful, because it does not account for the 
amount of sampling error associated with the estimate. 
Therefore, we also calculate the bias ratio, as in Brick et al. 
(1996). The bias ratio for scheme 1, ey is given by 


b. 


r= ; 

”se(B,) 
where se(p,) is the standard error of the standard estimate. 
Table 4 also gives the bias ratio for the selected estimates. 
DAY, TRAK, and AUGP give the largest bias ratios; for 
these adjustment schemes the bias is not negligible when 
we consider the standard error. DAYG and PROP have low 
bias ratios, indicating that the bias reduction is small 
compared to the error of the estimate. 


4.3 Mean Square Error 


Since the standard estimates are thought to be biased, 
error should be measured with mean square error rather 
than variance. The MSE of the standard estimate is 
approximated by 


mse {= var(p .) + b; (6) 


for each adjustment scheme. Recall that we are assuming 
the adjusted estimates are unbiased, so that the mean square 
errors of these estimates are equal to their variances. The 
variance of the adjusted estimates can be approximated by 
two methods. The first approximation is obtained by 
multiplying the VIF ratio in Table 3 by the variance of the 
standard estimate. Alternatively, we can use the variance of 
the adjusted estimate obtained from replication methods. 

The error of the adjusted estimate is compared to the 
error of the standard estimate in the mean square ratio 
(MSR). Using the VIF variance, the estimated MSR is 
given by 

100 x VIF Ratio; x var(p,) 


MST yp; (Pp) = nek Sa a (7a) 


For the replication variance, the estimated MSR is given by 


: 100 x var ,( p) 
Msfy,p;(P). = Tense Cae © (7b) 


where var ,() is the estimated variance of the adjusted 
estimate, obtained through replication. An MSR of 100 
indicates that the variance of the adjusted estimate is 
exactly equal to the mean squared error of the standard 
estimate. An MSR above 100 means the variance of the 
adjusted estimate is larger than the MSE of the standard 
estimate, and the bias/variance trade-off for the scheme is 
not favorable. An MSR below 100 means that the adjusted 
estimate is an improvement over the standard estimate in 
terms of overall error. 


bear 


Table 5 gives estimated MSR values for selected survey 
variables from the Appalachian Poll, and a summary of 
these values for seventeen variables from each adjustment 
scheme. The MSR estimates vary between regions and 
between schemes. The msr values computed using the two 
different variances also differ, but the summary values are 
similar for both variances. The DAY scheme has the 
highest msr values, indicating that this weight adjustment is 
not worthwhile because it increases the variance too much. 
TRAK and AUGP have the lowest mean and median msr 
values, though these schemes produced unfavorable esti- 
mates for a few characteristics as indicated by the high 
maximum msr values. The weighting class adjustment 
schemes BWKE and BWKT performed well and their 
maximum estimated mean square ratio values are fairly low. 
All of the msr values for the PROP scheme are near 100, 
suggesting that the overall error in estimates computed with 
this scheme is comparable to the error in the standard 
estimates. 


5. CONCLUSIONS 


While telephone use is commonplace, telephone surveys 
will always contain some bias since nontelephone house- 
holds are excluded from the sampling frame, and the non- 
telephone population has characteristics that differ from 
those of the telephone population. Coverage bias is 
alleviated by poststratification on variables such as income 
and education and may not be a problem in some instances. 
However, for surveys that target poor or rural areas where 
telephone penetration rates are lower, the coverage bias is 
a large concern. 

We have proposed a few new methods for reducing the 
coverage bias by adjusting the weights of respondents in the 
transient population. We compared the resulting estimates 
to those from other existing methods. In the analysis of 
these methods, it was assumed that the adjusted estimates 
are unbiased. In the absence of unbiased benchmark esti- 
mates this assumption cannot be validated. The mean 
square ratios presented here are likely to be biased down- 
ward since the bias of the adjusted estimate is not included. 
The estimated MSR is still useful for comparing methods, 
however, and gives a good measure of the effectiveness of 
the weight adjustments. 

As anticipated, the DAY method was found to have too 
much variability to be useful. The day group (DAYG) 
method appears to perform better, but most of the mean 
square ratios for this scheme are close to 100, meaning that 
we do not see a large improvement over the standard esti- 
mate. The advantage of this scheme lies in its simplicity. 
The weight adjustment is easy to apply and does not require 
auxiliary data. 
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Table 5 
Mean Square Ratio for Selected Characteristics 


VIF Mean Square Ratio 


Variance Mean Square Ratio 


See DAY DAYG BWKE BWKT TRAK PROP AUGP DAY DAYG BWKE BWKT TRAK PROP AUGP 
Owns Home 
Non-Appalachian Ohio 96.1 OT) 2a 88.1 igesy — WE)s5 84.5 Cree RES RSPR RR TES as RO Tk) 
Appalachian Ohio 423 994 8s 10819) Oo ee Oe 52:5 Mdey 96:5. 896 71.1 S16 99:2, 466 
Non-Appalachian Kentucky 63.9 976 945 77.7 Sal 99.4 83.5 21.9 98.2 CYR SE S13 oe 95.5) cel 
Appalachian Kentucky 89:33" “96 10237 ~ 770) 1106 100:3> 1124 116.0 100.7 104.1 88.4 119.2 100.0 118.3 
Non-Appalachian West Virginia NG GEA 5H VE) ABT REE TS YS) 28.6 81.1 944 714 85.5 100.1 84.2 
Appalachian West Virginia 20.2 98.4 103.2 100.3 109.3 1005 110.6 43.5 106.0 101.1 1048 1088 99.1 108.9 
No Health Insurance 
Non-Appalachian Ohio 9919 99/0 T8522 5 Olk4 wae 5:9) 809 Seer 85) 98.8 100.5 12a SAON:7 S20) S1OL98 W76n 
Appalachian Ohio 126:8) 7 101-35 102.19 106.5) 5125.1 99:9 23.5 9234. 98:8 95.6 9 194.0)9105:8> 9910) 9 100:4 
Non-Appalachian Kentucky 206.4 99.9 100.9 102.8 103.0 99.8 102.7 39.0 87.9 90.7 $6.2) 0 9728) 010m 95. 
Appalachian Kentucky 82.7 106.7 104.4 103.7 102.1 STD 84.3 33-5 109:9 104.9 1053 1141 T0007 1005 
Non-Appalachian West Virginia 100.2 97.1 SOG 840F T7951 STO Tes PSGtOR a oz 940 89.7 906 100.8 84.3 
Appalachian West Virginia 149:6,7" 991274 83:56 52:8) EOS a AGH 107.0 96.5 TSA 84.3 SSM 96u7 45:4. 
Not enough Money for Food 
Non-Appalachian Ohio 86:57 6905 SOS S79 8452 9987.) 45:6 105.2 100.8 104.3 94.5 66:3 )y102'0 5 67:0 
Appalachian Ohio 31:9 92,9) 9 STAM eSorl 486 994 464 68.5 98.2 96:8 90!2, 69/4101 66.4 
Non-Appalachian Kentucky 139.1 94.1 18:2, 69:5 69.5 96.7 64.3 320.7 96.8 91.9 85.7 77.6 100.1 68.5 
Appalachian Kentucky 223 55.85 SSOmeeo Ss, 310) 97:2, 431.9 30.5 68.4 68.5 69.5 36.8. 100.2, } 38.5 
Non-Appalachian West Virginia 117.3 102.6 82742 70 59:65 97 Sie e 105.7 101.9 94.7 8873) 7S | JOG. 68:5: 
Appalachian West Virginia 1816 97.0 84.1 88.5 504 94.1 39.9 922 98:8 the ote PLPC ve ON ACHES, eyes} 
Computer in Home 
Non-Appalachian Ohio 98.1 98.5 964 88.2 88.1 99.8 86.1 99°35; 995 102.0 102.1 106.3 100.6 102.8 
Appalachian Ohio 127.2 POLO L Ze S103 30 9 OTF 96:2: 1 9957) 9255 116:0: | 99:6") 101239 965059441 99.1 86.5 
Non-Appalachian Kentucky Gis “ROSS Ws Are 92 Tat 93:6 BEDS eee92/8 27.1 93.7 915 S97, 111903 1982. 884 
Appalachian Kentucky 147.1 S90 O17, 85.1 55.7 100.3 68.4 S89) Sila) 855 79.5 468 1009 66.6 
Non-Appalachian West Virginia 66.8 96.9 86.6 85.8 76.1 98:5. _ 13:5 59:65, 95:8 85.1 S5s3.see 2-O) oS OmELLOSIO 
Appalachian West Virginia 82.7 95.6 1044 103.0 111.2 99.6 108.2 41.8 88.1 LONG 9919 IS Sos See 0 710) 
Summary of Seventeen Variables 
Mean 1376 | 975 94.3 92.2 S512, 9933 83.8 12522 0) oR 97.1 96:3: 96:0 10025 5593'S 
Median LOTS 299 '0Ne 994 OTA 89.8 99.8 86.3 ee B50) 987 985 98.0" 100:0° 92.4 
Minimum 10.9 55.8 O09 57-9 4.1 94.1 Sif 7.0 68.4 43.1 62.1 76 94.6 6.0 
Maximum 607.7 108.5 104.8 109.1 133.1 100.5 133.5 695.2 140.8 144.5 147.5 593.8 116.7 545.4 
Percent below 100 47.1 60:8... 61:8 58:8 65.7 87.3 67.6 632g 6277; 56:9 s5S:8y 53:95 SSS BS88 


Note: In addition to the four proportions listed in the table, the summary of seventeen variables includes worry about income, better off 
economically in the 1990's, dissatisfied with own net worth, married, have children, unemployed, college graduate, in good or excellent health, 
serious illness in household, no family doctor, satisfied with own housing, very safe drinking water, and internet access in home. 


The weighting class adjustment schemes have the benefit 
of giving more weight to respondents in cells where the 
likelihood of having a phone is lower. For these schemes, 
greater bias reduction was seen in variables correlated with 
the classification variables. For example, home ownership 
and computer ownership are positively correlated, and the 
BWKT scheme, which classified respondents by home 
ownership, produced estimates of the percent of households 
with a home computer that were consistently lower than the 
standard estimates. Table 5 shows that the BWKE and 
BWKT schemes produce an improved estimate most of the 
time. It should also be noted that when these schemes 
produce an estimate that it not an improvement, the increase 
in variance remains fairly small. The weighting class adjust- 
ment method works well for samples of large populations, 
such as states or countries, since the outside data needed to 
compute the adjustments is readily available. The method 


is more difficult to use for very specific samples such as 
counties. 

The raking ratio adjustment, TRAK, produced a number 
of very favorable estimated MSR values. With this scheme 
we were able to account for the difference in telephone 
penetration rates by region, but not the differences across 
other demographic characteristics. Variability was intro- 
duced when we estimated the regional rates from the state 
rates, thus, as with the weighting class adjustment, the 
scheme works better for samples of larger populations. 
While the mean and median estimated MSR values were 
low for this scheme, the scheme also produced some high 
mean square ratios. The higher ratios occurred in Ohio 
where the percent of transients in the sample was low 
compared to the estimated percent without continuous 
service. 
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The propensity adjustment alone, PROP, provided too 
little reduction in bias to be worthwhile. The propensity 
adjustment is advantageous, however, because it allows us 
to account for differences in the likelihood of having tele- 
phone service without using outside data. When used in 
conjunction with raking, the propensity based scheme 
AUGP produced good results. 

There are many issues to consider when determining 
which adjustment scheme is preferred. As mentioned 
previously, the weighting class adjustment schemes BWKE 
and BWKT are difficult to implement if you have a very 
specific target population. These schemes are fairly conser- 
vative, however, in that they typically reduce the bias 
without increasing the variance. The schemes that 
employed raking usually performed better than the 
weighting class adjustment schemes, but the larger weight 
adjustments sometimes led to increased variances. It may be 
advisable to compute estimates using several schemes and 
then determine which scheme offers the best bias-variance 
trade-off. 

Brick et al. (1996) note that these weight adjustments for 
telephone coverage should be more beneficial in reducing 
mean squared error when the sample size of the survey is 
large. As the sample size increases, the bias ratio increases 
since the bias is unaffected but the standard error of the 
estimate, which is in the denominator, decreases. 

The findings suggested by this study and others indicate 
that the adjustments could be useful for many estimates 
from telephone surveys and should be seriously considered. 
The benefits of adjustment appear to outweigh the penalties 
in the weighting class adjustment schemes, the raking 
scheme, and the augmented propensity scheme. In light of 
the smaller sample size and special target population of the 
Appalachian Poll, generalizations of these findings should 
not be made until the methods receive further evaluation. 
These weight adjustments still need to be tested using a 
survey that is free of coverage bias, one that includes 
nontelephone households in the sampling frame and 
collects information on telephone status, in order to assess 
the validity of the assumptions. Data from the National 
Survey of America’s Families, or the National Health 
Interview Survey may be appropriate for evaluating the 
adjustment methods and the assumptions. 
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APPENDIX 
Logistic Regression of Transient Status 


Below is our model for predicting transient status. Most 
of the variables in the model relate to socioeconomic status. 
The coefficients indicate that young people, those with low 
income, those who are not employed full-time, American 
Indians and African Americans, and residents of distressed 
counties have higher propensities for transience. The high 
significance level of the Hosmer and Lemeshow test indi- 
cates a very good fit of the model. The large area under the 
ROC curve tells us that the model discriminates well. 


Variable Coding 
Age 
0 - “Refused” (Count = 9) 
1 - 18 to 29 years 
2 - 30 to 44 years 
3 - 45 to 59 years 
4 - over 60 
Low Income 
O - Household income over $20,000 or refused 
1 - Household income under $20,000 
Employment Status 
0 - Employed full-time or retired 
1 - Other (refused, part-time, housekeeper, student, 
unemployed, other) 
Race 
0 - Caucasian, Alaskan Native, Hispanic, or Asian 
1- American Indian, African-American, Black, or 
other 
Appalachian 
0 - Does not live in a distressed county of KY, OH, 
or WV 
1 - Lives in a distressed county 
Kentucky/West Virginia 
0 - Ohio 
1 - Kentucky or West Virginia 


Results 
Variables in the Equation 

Variable B S.E. 
Age (Refused) -2.107 12.160 
Age (18-29) 2.006 0.357 
Age (30-44) 1.664 0.347 
Age (45-59) 1.064 0.364 
Low Income 1.358 0.189 
Employment Status 0.397 0.187 
Race 1.136 0.292 
Appalachian 0.531 0.196 
KY/WV 0.567 0.216 
Constant -5.712 0.401 

Hosmer and Lemeshow Goodness of Fit Test 
Chi-Square 3.568 
Degrees of Freedom 8 
p-value 0.894 

ROC Curve 

Area under the Curve 0.782 
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The Effect of Intensity of Effort to Reach Survey Respondents: 
A Toronto Smoking Survey 


LOUIS T. MARIANO and JOSEPH B. KADANE!’ 


ABSTRACT 


The number of calls in a telephone survey is used as an indicator of how difficult an intended respondent is to reach. This 
permits a probabilistic division of the non-respondents into non-susceptibles (those who will always refuse to respond), and 
the susceptible non-respondents (those who were not available to respond) in a model of the non-response. Further, it 
permits stochastic estimation of the views of the latter group and an evaluation of whether the non-response is ignorable 
for inference about the dependent variable. These ideas are implemented on the data from a survey in Metropolian Toronto 
of attitudes toward smoking in the workplace. Using a Bayesian model, the posterior distribution of the model parameters 
is sampled by Markov Chain Monte Carlo methods. The results reveal that the non-response is not ignorable and those who 
do not respond are twice as likely to favor unrestricted smoking in the workplace as are those who do. 


KEY WORDS: Call-backs, number of; Bayesian analysis; Markov Chain Monte Carlo method; Informative non-response; 


Ignorable non-response. 


1. INTRODUCTION 


Given the reality of non-response in every survey, it is of 
interest to determine how to account for this non-response 
in the interpretation of the collected data. Rubin (1976) 
gives necessary and sufficient conditions for such an 
analysis to be identical from, respectively, a frequentist, 
likelihood, and Bayesian perspectives, to an analysis based 
on a model incorporating a missingness mechanism. 
Building on this, Little and Rubin (1987) led to an extensive 
literature modeling non-response in an informative, non- 
ignorable way. 

Information about the interaction between the survey and 
the surveyed can sharpen the analysis of the import of 
missing data in a survey. The example in this paper 
concerns the attitudes of Toronto citizens about smoking in 
the workplace. Random telephone numbers were chosen; at 
least twelve calls were made to try to reach the intended 
respondents. Our data for the respondents includes only the 
number of calls until the survey was completed, not the 
timing of the unsuccessful calls. With even this attenuated 
data on how difficult the respondent was to reach, we find 
our view of the results of the survey to be importantly 
informed by the number of unsuccessful calls. 

The use of information on the number of calls to a 
subject chosen to participate in a survey is not unique. 
Potthoff, Manton and Woodbury (1993) present a method 
for correcting for survey bias due to non-availability by 
weighting based on the number of call-backs. While our 
analysis also focuses on the bias due to non-availability, 
there are major differences. Instead of assuming that 
refusals do not exist, we allow for and utilize their potential 
existence in modeling the mechanism which causes non- 


response. In the analysis that follows, the relationship of 
non-response to the response variable of interest in the 
survey is evaluated along with other explanatory variables, 
after weighting for both household size and the appropriate 
population demographics. In doing so we address not only 
whether error exists due to non-availability, but also 
whether stratification of the respondents by household size 
and the then current age/sex distribution may eliminate the 
necessity for accounting for the error by the introduction of 
a mechanism which describes the non-response. Note that 
here we match the groupings of Pederson, Bull and Ashley 
(1996) used in the original published analyses of the 
dataset; more complex cell adjustment procedures are 
possible (e.g., Little 1996; Eltinge and Yansaneh 1997, and 
references cited therein). 

The remainder of this article is organized as follows: 
Section 2 gives more detail on the survey; section 3 
introduces the methodology employed; Sections 4 and 5 
respectively explore missing-at-random and non-ignorably- 
missing models; Section 6 discusses the priors distributions 
chosen for the main analysis, whose results are explained in 
section 7. Finally, section 8 gives our conclusions. 


2. THE SURVEY 


A bylaw regulating smoking in the workplace in the City 
of Toronto took effect on March 1, 1988. From January 
1988 to the present, a series of six surveys have been 
conducted to assess attitudes of the public toward smoking, 
awareness of health risks related to smoking, and the impact 
of the law on the residents of Metropolitan Toronto. The 
data being utilized in this analysis comprises the third phase 


' Louis T. Mariano is a Ph.D. candidate, Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213; Joseph B. Kadane is Leonard J. Savage 
University Professor of Statistics and Social Sciences, Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213. 
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of this series. Northrup (1993) provides the technical 
documentation for this survey. For clarity, when necessary, 
the data being analyzed here is referred to as the Phase III 
data, and information from the first two surveys is referred 
to as the Phase I & II data. 

Northrup (1993) indicates that the data of interest, which 
were made available by the Institute for Social Research 
(ISR) at York University, were collected from 1,429 
residents of the Metropolitan Toronto area in December 
1992 and March 1993. A two-stage probability selection 
process was utilized to select survey respondents. The first 
stage employed random digit dialing. The second stage 
used the most recent birthday method to select one adult 
individual once an eligible residence was reached. The 
responses were then weighted by the number of adults in 
the household. In the analysis that follows, post-stratifi- 
cation weighting was also applied to the census age-sex 
distribution to adjust for the underrepresentation of some 
population subgroups. The number of distinct phone lines 
in the household was not taken into consideration during 
the data collection. 

The number of calls it took to reach each respondent is 
included as a variable in the dataset, and there are no 
missing values for this variable. Northrup (1993) explains 
that the 1,429 responses came from a sample of 5,702 
telephone numbers generated by the random digit dialing 
method. Of these numbers, 2,286 were verified to be 
eligible households, and 3,150 of the numbers in the sample 
were not eligible. The status of the remaining 266 numbers 
was not able to be determined. It has been assumed by ISR 
that the household eligibility rate of these 266 numbers was 
equal to the rate for the rest of the sample. This eligibility 
rate implies an estimated total of 2,398 households in the 
sample and a response rate of 60%. Thus, an estimated 969 
subjects chosen to participate in the survey did not respond. 
Each subject received a minimum of 12 calls, including 
day, night, and weekend calls, before being classified as 
non-respondent. 

The dependent variable, for the purpose of this analysis, 
is an individual’s opinion on the regulation of smoking in 
the workplace, in one of three categories. Category “0” 
indicates smoking should be permitted in restricted areas 
only, category “1” indicates smoking should not be 
permitted at all, and category ‘‘2” indicates smoking should 
not be restricted at all. For each subject chosen to parti- 
cipate in the survey, let Y, € {0, 1,2} represent the opinion 
of subject 0. 

The data comprises of the answers to 50 survey 
questions as well as 18 other variables identifying charac- 
teristics of the subject. Included in these are: 


— “K-risk” is an integer score from 0 to 12 which 
indicates knowledge of the risks and effects of 
second-hand smoke. 


— “Smoker” indicates the smoking status of the 
subject: “Current smoker” (S), “Former smoker” 
(SQ) or, “Never smoked” (NS). 


— “Bother” indicates if second-hand smoke bothers 
the subject: “Always bothers” (b.A), “Usually 
bothers” (b. USUL), or “Does not bother” (b.NO). 


— “Age”: (Age in years - 50) / 10. 


Pederson, Bull, Ashley and Lefcoe (1989) created a 
“Knowledge of health effects score” on passive smoking 
out of the answers to six survey questions, which measured 
a subject’s knowledge of the effects of second-hand smoke. 
Pederson et al.’s questions were used in Phase III to create 
their score, here renamed “K-risk”. A higher K-risk score 
indicates a greater knowledge of the risks of second-hand 
smoke. The variable “Age” was shifted and rescaled to 
match how age was treated by Bull (1994) in the Phase I & 
II analysis. 


3. OVERVIEW OF METHODOLOGY 


The fundamental question of interest is: “May we ignore 
the unit non-response and treat the observed data as a 
random subsample of the population?” Mapping to the 
terminology of Little and Rubin (1987) and Rubin (1976): 
If we may treat the observed data for the dependent variable 
of interest as a random subsample, we call the missing data 
“missing completely at random” (MCAR). If we may treat 
the observed data for the dependent variable of interest as 
a random subsample, after conditioning on the explanatory 
variables, we call the missing data “missing at random” 
(MAR). Let 6 represent the parameters of the data and let 2 
represent the parameters describing the missing data 
process. Rubin (1976) calls the parameters 1 and 0 distinct 
“if there are no a priori ties, via parameter space restrictions 
or prior distributions, between a and 0.” If either the 
MCAR or MAR cases apply and if a and 6 are distinct, the 
mechanism which causes the missing data is said to be 
“ignorable” for inference about the distribution of the 
variable of interest. If the missing data for the dependent 
variable of interest is dependent on the values of that data, 
then the mechanism which causes the missing data is said 
to be “non-ignorable” (NI). Groves and Couper (1998) note 
that when the likelihood of participation is a function of the 
desired response variable, the non-response bias can be 
relatively high, even with a good response rate. 

Let R, be an_ indicator of _ response. 
R; = I espondent) Subject by andiRo=(Ryy nye )’. Little and 
Rubin (1987) suggest that one possible method for 
accounting for the non-response mechanism is to include 
this response indicator variable in the model. We may call 
the mechanism which causes the missing data ignorable if 2 
and @ are distinct and: 


F(R | Y obs? ys T) =f(R | Y obs? T) (1) 
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where Y. and Y_.. represent the observed and missing 
portions of the dependent variable of interest. 

The terms “MAR assumption” and “NI assumption” will 
be used throughout this analysis. For clarity, the term 
“MAR assumption” is defined as the assumption that the 
missing data mechanism is ignorable for inference with 
respect to the dependent variable identified in section 2. 
That is, the observed values of that variable are a random 
subsample of the population, possibly within poststrata, and 
it is not necessary to account for the missing data mecha- 
nism. The term “NI assumption” is defined as the assump- 
tion that the missing data mechanism is non-ignorable and 
the data collected for the dependent variable of interest 
cannot be treated as a random subsample. Specifically, 
inference for the population must involve the missing data 
mechanism. 

The approach to assessing the MAR assumption is 
comprised of three steps. The first step is the examination 
of what one might do under the MAR assumption. Since 
the dependent variable of interest has three categories and 
some of the explanatory variables are quantitative, poly- 
tomous logistic regression is employed. Both frequentist 
and Bayesian forms of the logistic regression model are 
examined. 

In the second step, and NI model is constructed. The 
non-response mechanism is modeled utilizing the infor- 
mation available about the number of calls made to each 
subject. Here, the idea of a surviving fraction in the sample 
is examined to model whether it is actually possible to 
reach all the intended respondents. Then, the non-response 
mechanism is related to the dependent variable by including 
the number of calls in the logistic regression model. 

In the development of the NI model, we employ a 
Bayesian approach to allow for an examination of the 
values the missing data re likely to take, given the observed 
data and the model parameters. This is accomplished by 
utilizing a data augmentation approach, where the missing 
data are imputed in each iteration of a Markov Chain Monte 
Carlo (MCMC) simulation. A possible alternative would be 
to utilize the expectation-maximization (EM) algorithm 
(Dempster, Laird and Rubin 1977) to compute the maxi- 
mum likelihood estimates (MLE’s) of the missing values. 

In the third step, an evaluation of the MAR assumption 
is made. Non-zero coefficients for the number of calls in 
the logistic regression portion of the NI model will imply 
that the number of calls does make a difference; i.e., the 
opinions of those who did not respond in the first 12 calls 
are likely to differ from those who responded in just a small 
number of calls. In this case, the missing data mechanism 
is not independent of the values of the missing data and an 
MAR assumption would be inappropriate. Next, the log 
odds of response among the three models are examined. 
Differences here identify the magnitude of the error that a 
faulty MAR assumption causes. So, in the evaluation of the 
MAR assumption, the questions “‘is there a difference?” and 
“how large is the difference?” are both addressed. 
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4. MAR MODELS 


4.1 Logistic Regression 


Using the data collected from the (m = 1,429) subjects 
that did respond to the survey, weighted logistic regression 
was employed to model the public’s opinion on smoking in 
the workplace. The collection of candidate predictors found 
in the survey questions and the background information was 
narrowed utilizing a series of Wald tests. Then likelihood 
ratio tests, AIC, and BIC were used to compare the possible 
models. The model with the best fit was found to be the one 
which included additive terms for the variables “K-risk”’, 
“Smoker”, “Bother”, and “Age”, as defined in section 2. 

As each of the models examined in this analysis employs 
a logistic regression component, it is useful here to illustrate 
the notation being used. Category “0”, “smoking allowed in 
restricted areas only’ was chosen to be the reference 
category. Recall Y, € {0, 1, 2}. For the MAR model, we use 
only the observed values of the subject’s opinion on 
workplace smoking, Y,, = (Y,,.-.. Y,,)- Let Y;, = I) (Y,) be 
an indicator of subject i responding in category j, and let W, 
represent the weight each subject received. As in the 
original published analyses of this dataset (Pederson et al. 
1996) both household (see Northrup 1993) and post- 
stratification (see Appendix A) weighting were used in the 
consideration of all models here. 

The two categorical explanatory variables, “Smoker” and 
“Bother”, were included in the model by utilizing indicator 
variables for two of the three categories, with the effect of 
the third category being absorbed in the intercept term. For 
“Smoker”, “S.” and “SQ,” were included as indicators that 
subject i was either a current smoker or a smoker who had 
quit. For “Bother”, “b.USUL,” and “b.NO,” were included 
as indicators that second had smoke usually bothered or did 
not bother subject 7. 

Let X; = represent the vector for explanatory variables 
for subject 7. Then, 


X, = (K-risk,, S,, SQ,, b-USUL,, b.NO,, Age;). 


Here we use an unordered multinomial logit model to 
consider p,(x,)=P(Y,,=1|X, =x;,),the probability that 
subject i responds in category je {0,1,2}, given the 
observed explanatory variables for subject i. This model, of 
course, utilizes linear equations n,, describing the log odds 
of subject i responding in category j versus the reference 
category j = 0. So, for j = 1, 2 we wish to examine: 


3) 

pes eS 
Pea 

with n,, =0. The two resultant linear equations, n,, and 


TN,» each have seven coefficients, including an intercept 
term Boj and those displayed below: 


Ty = Bo + X;B;. (2) 


B; r (Brisk, Bs Bso. B, usut, Brno, Base): 
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The MAR logistic regression model has 14 parameters. 
The vector of these 14 parameters, represented by 
B = (Bo,»B,,Bo)»B,) has the likelihood (or, more appro- 
priately, pseudo-likelihood, since the weights are incor- 
porated through the variable W,): 


Ni Yij i 


m 2 
LB) « [J ]]] ————|_ . (3) 


4.2 Bayesian Logistic Regression 


The likelihood in equation (3) and the data collected 
from the survey respondents are utilized in the Bayesian 
analysis. The same four explanatory variables selected in 
the frequentist analysis above are used as the explanatory 
variables here. Prior distributions, discussed in section 6, 
were assigned to the logistic regression parameters. An 
MCMC simulation is utilized in order to draw from the 
posterior distribution of the parameters. 


5. NIMODEL 


5.1 Modeling the Non-Response Mechanism 


Since the missing values are not necessarily missing at 
random, the mechanism which caused them to be missing 
must be addressed. Northrup (1993) indicates that non- 
respondent subjects chosen to participate in the survey were 
called a minimum of 12 times, including a minimum of 
three day, four evening and four weekend calls. Unfortu- 
nately, other useful information regarding the number of 
calls was not retained. We do not know which of the non- 
respondents were called more than twelve times or whether 
an individual call was placed during the day, evening, or 
weekend. We also are unaware of the details of the non- 
response, such as whether the subject was contacted but 
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refused to participate, whether the calls were ever answered 
by a machine, or whether they were answered at all. Thus, 
stratification of the non-respondents was not possible, and 
they were all treated as exchangeable in this analysis. 

Each subject was called a number of times until the 
survey was successfully completed or they were classified 
as non-respondent. For the respondents, the number of calls 
variable (C;) describes the number of trials until the first 
success for subject i. Thus, one might expect the number of 
calls to follow a Geometric distribution with truncated 
observations for the non-respondents. Specifically, let 
nm = P(a calls is © "successful); \)\ then} ~ ‘consider 
C, ~ Geometric (nm) and P(C,=c,) = =(1 -2)* '. Note that 
if auxiliary information about the number of calls to the 
non-respondents were available (e.g., Groves and Couper 
1998), we could have also considered conditional response 
probabilities here. 

The histograms in Figure 1 compare the data (through 
the first twelve calls) to a Geometric distribution with 
parameter m =.225, which appears to match fairly well. The 
sample order statistics suggest 7 € (.2,.25). The histogram 
of the actual survey data reveals that the number of subjects 
reached on the first call are fewer than the number reached 
on the second call. It is possible that more of the second 
calls were placed at a time which had a higher success rate. 

Suppose 2=.225; by the memoryless property of the 
Geometric distribution, we would expect 218 of the 969 
non-respondents to reply on the 13" call. This would make 
the data through the first 13 calls appear as in Figure 2. 
Clearly, Figure 2 does not display the behavior of a 
Geometric random variable. Consider the following 
question: “If all subjects were called an unlimited amount 
of times, would they all have been reached?” Answering 
“yes” to that question for this dataset results in the problem 
illustrated in Figure 2. 


2 Sees HOM eou Ot Ob 
Expected Under Geometric (.225) 


Number of Call Attempts 


Figure 1. Comparison of the actual survey data for sucessful calls in the first 12 attempts to expected results based 
on a Geometric (.225) distribution for the number of calls needed to complete the survey. 
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Figure 2. Display of the actual number of successful calls on each attempts through the first 12 and the 
expected number of successful calls on the 13" attempt. The expectation for the 13" callis based 
on a Geometric (.225) distribution to model the number of calls until the survey is completed 


Given the information outlined above, the assertion that 
“not all subjects chosen for the survey are reachable” is a 
viable one. Maller and Zhou (1996) discuss immune 
subjects — individuals who are not subject to the event of 
interest. Following their terminology, if it is not possible to 
procure a response from a subject chosen for the survey 
given an unlimited amount of calls, that subject is cate- 
gorized as immune. Subjects who are not immune are 
categorized as “susceptible”. The set of immune (i.e., non- 
susceptible) subjects comprise the “surviving fraction” of 
the sample. Mapping to more familiar terminology, the 
immune subjects include those who were reached and 
refused, those who would have refused if they had been 
reached, and those cases of a physical or mental inability to 
ever participate. Northrup (1993) indicates that those who 
initially refused to participate were subsequently contacted 
by the most senior interviewers, so, we make the assump- 
tion here that all remaining refusals would not ever parti- 
cipate. The susceptible group includes the respondents, 
those who would have responded if successfully contacted, 
and those who were physically or mentally unable to parti- 
cipate during the data collection period but were willing 
and able at some other time. 

Let the variable Z, = (susceptible) SUbject i) be an indicator 
of the susceptibility of subject i, and 
p=P(subject i is susceptible), ie., Z,~ Bernoulli (p). 
Now suppose that the number of calls to the susceptible 
subjects follows a Geometric distribution, Le., 
C,|Z, = 1~ Geometric (x). Does this eliminate the problem 
illustrated in Figure 2? 

Let R, be an indicator of response of subject 7. The non- 
response mechanism can be accounted for by including 
these response indicators in the model. However, the intro- 
duction of the susceptibility variable implies two distinct 


classes of non-response. So, it is possible to be more 
detailed and use both the susceptibility Z = (Zien Ze y and 
the response R indicators in a mixture model describing the 
non-response. Updating Equation (1), the missing data 
mechanism is ignorable if and only if (7, p) is distinct from 0 
and 


F(R, Z| Vous> Yiniss ™> P) = F(R, Z| Yous, P)- (4) 


Let C= (Copane ean 2. a= Za, 22.21) De. the 
vectors of the number of calls and the observed suscepti- 
bility for each respondent. Also, let R =(R,,...,R,) = be 
the vector of response for each intended respondent. Every 
subject, 1, may be classified by response into three mutually 
exclusive groups, A,,.- observed, A_;.- missing, and 


A. —immune, where: 
imm 


Ap, = {i: | was Susceptible and Responded} 
Anis = (4: 1 was Susceptible but did 

not Respond in 12 calls} 
Ainm = (i: 1 was not Susceptible}. 


The probability that a subject is in each of these categories 
may be calculated as follows: 


PSA.) = P(Z,=1,R,=1,C,=c,) =pn(1-n)"! 
P(GeA,,.) =P(Z,=1,R,=0,C>12) =p(1-2)” 
PCE Aim) = P(Z,=0) =] = [Ue 
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The data indicates m= 1,429 subjects in A... and 
n-m = 969 non-responsive subjects in A, UA. ; 
n=2,398 is the estimated total number of subjects chosen to 
participate in the survey. Thus, the joint density of Z,., R 
and C,,, given p and 7 is: 


F(Zobs9 R, Coos | p,7) ta 
pram mE bl (1 -p) + pa-m2™. (5) 


The mixture model described by Equation 5 may be 
viewed as a special case of the non-response models 
discussed in Drew and Fuller (1981). 

It would be useful to confirm that the above joint 
distribution accurately represents the response pattern of the 
susceptibles in the dataset. The MLE estimate for p is 
simply the proportion of respondents in the sample, which 
clearly underestimates p. Setting U(0, 1) prior distributions 
for both p and z and examining their joint posterior distri- 
bution by MCMC simulation, the posterior medians are 
found to be p =.636 and m =.205, with equal-tailed posterior 
credible intervals of (.613, .659) and (.191, .219) for p and 
m™ respectively. Figure 3 illustrates how the dataset might 
look after imputing the missing number of calls for our 
susceptible non-respondents based on these posterior 
medians. The problem previously displayed in Figure 2 has 
now been mostly eliminated. 

While the Geometric distribution appears sufficient 
(after accounting for susceptibility), a referee questions the 
use of the Geometric distribution as it does not make use of 
possibly useful covariates. As explained above, the cova- 
riates we think would be most useful for this purpose were 
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not collected. One alternative for modeling the response 
mechanism of the susceptibles is to use a discretized 
Gamma distribution. In cases where more complexity is 
necessary, the v-Poisson (a two parameter Poisson which 
generalizes some well known discrete distributions, 
including the Geometric) of Shmueli, Minka, Kadane, Borle 
and Boatwright (2001) may also be considered. 


5.2 Relating Non-Response to the Dependent 
Variable — The NI Model 


Since the non-response of the susceptibles is described 
by the conditional Geometric distribution of the number of 
calls, the effect of the non-response of the susceptibles on 
the dependent variable may be considered by including the 
number of calls as an additional explanatory variable in the 
logistic regression likelihood. This will create two addi- 
tional parameters in the logistic regression portion of the 
model, which are the coefficients of the number of calls, 
B.a4n1n each of the linear equations Nij described in 
equation (2). 

Non-zero coefficients for the number of calls, then, 
would indicate that the dependent variable is not indepen- 
dent of the non-response mechanism, and, hence the non- 
response mechanism is non-ignorable. If these coefficients 
are zero, the non-response of the susceptibles is ignorable. 
Conclusions made here rely upon the underlying modeling 
assumption that the relationship among the number of calls, 
the dependent variable and the other explanatory variables 
considered is the same for the respondents and susceptible 
non-respondents. Including the number of calls in the 
logistic regression portion of the model does not address the 
immune subjects, since there will never be the realization of 
a successful call to them. 


4) Actual Data 
[] Imputed Future Calls 


Number of Call Attempts 
Figure 3. Display of the actual number of successful calls on each attempt through the first 12 and the 
expected number of successful calls for call attempts 13 and higher. Imputed values are based 
on a probability of a successful call of .205 and a probability of susceptibility of .636. 
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The full pseudo-likelihood for the NI model (or, more 
precisely, the susceptible NI model) is the product of the 
non-response and logistic regression pieces: 


L(p,7,B) « 
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Note that the household and post-stratification weighting 
variable W, is included here in an effort to account for 
whether proper stratification of the respondents may 
eliminate the necessity for the introduction of a mechanism 
to describe non-response. 


5.3. Data Augmentation 


Tanner and Wong (1987) suggest an iterative method for 
computation of posterior distributions when faced with 
missing data. This method applies whenever augmenting 
the dataset makes it easier to analyze and the augmented 
items are easily generated. Consider the following addi- 
tional notation: Let S represent the total number of 
susceptible subjects in the sample. 
S=Y;_, Z,,5~ Binomial(p). Let X be the matrix of explana- 
tory variables (including the number of calls) for all the 
subjects selected to participate in the survey. Let 
Y=(Y,, ..., Y,) be the vector of their responses. Partitions 
AMO XS EX 2 Aen ana into {Yon Y 1, <}., Also, 
by the memoryless property of the Geometric distribution, 
the distribution of the additional number of calls required to 
reach the subjects in A... is known, and may be expressed: 
Vie A... let V; = C;-12,which is also distributed as a 
Geometric random variable with parameter 7. 

Now suppose that the true values of S, X,,., and Y,... 
were known. The likelihood could then be considered in the 
form: 


14 


obs’ 


Y 


mis? 


L(p,%,B |X... X, 


obs’ “mis? 


S,R) 


=| ema -m 2 |x [a -py"*] 


n., YigMi 
Ce 
| : (7) 
Ni2 


2 
<A lel 
l+te™“+e 


i=] j=0 


where YC... = )) Coy. + » CV; + 12) is the number of calls 
that would have been necessary to reach all susceptibles and 
the summands are taken over the appropriate range of 
subjects. 
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Although the true values of S, Ajo and Y_.. are 
unknown, one may utilize what is known about the 
behavior of these variables to impute stochastically possible 
values for them within the MCMC algorithm. Given p, a 
value for S may be drawn from a truncated Binomial (2,398, 
p), where 1,429 < S$ < 2,398. Given S$, the number of 
subjects in A, is known. For each of these subjects in A_,.. 
a value V, ~ Geometric () may be drawn, which results in 
an imputation for the number of calls needed to reach each 
susceptible but unreached subject. The relationships among 
the number of calls and the other explanatory variables may 
then be exploited to impute values for the rest of X_... 
Specifically, the missing values of Age and K-risk are 
imputed by regressing Calls on Age and K-risk respectively 
and predicting from the resultant linear equations. Simi- 
larly, the missing values of Smoker and Bother are imputed 
via logistic regression on each, using Calls as the explana- 
tory variable. Here the model assumptions are checked 
using the respondents data, and an assumption is being 
made that these same relationships hold for the susceptible 
non-respondents. Note that these regression and logistic 
regression equations are fit in the Bayesian context (e.g., 
Gelman, Carlin, Stern and Rubin 1998) and necessitate the 
inclusion of additional parameters, B,, in the MCMC 
process which describe these relationships (see Appendix 
B for more detail). We chose this imputation plan in the 
interest of the efficiency of the full MCMC algorithm. An 
alternative would be to impute the missing values for a 
particular explanatory variable conditional on all the 
remaining variables (e.g., Rubin 1996). Finally, Y_.. may 
be predicted by utilizing the imputed values of X,.. and the 
relationship described in the logistic regression model. In 
the interest of the exchangeability of the susceptible non- 
respondents in the absence of subsequent stratification 
information, we apply a weight of 1.0 to all the imputed 
Y .,; Values; an alternative here would be to impute the sex 
and household size of the susceptible non-respondents, in 
addition to their age, and apply the weighting procedure 
described in Appendix A to the imputed Y,... 


5.4 Sampling from the Posterior Distribution 


The full MCMC simulation consists of a Metropolis 
algorithm supplemented in every iteration with the data 
augmentation described above. An outline of the MCMC 
algorithm used may be found in Appendix B. Convergence 
was assessed utilizing the method of Hiedelberger and 
Welch (1983) as described in Cowles and Carlin (1996). 
MacEachern and Berliner (1994) assert that, under loose 
conditions, subsampling the MCMC simulated values to 
account for autocorrelation will result in poorer estimators. 
Following their suggestion, all simulated values, after an 
appropriate burn-in period, were used in the analysis that 
follows. 


138 Mariano and Kadane: The Effect of Intensity of Effort to Reach Survey Respondents 


6. CHOICE OF PRIOR DISTRIBUTIONS 


In the evaluation of possible prior distributions for the 
parameters of both the NI and MAR models, the goal of the 
comparison of the various models was taken into consi- 
deration. The choice of prior distributions for the para- 
meters was made from the perspective of the MAR belief. 
Two possibilities were examined. 

The first option is built around the utilization of the 
Phase I & II surveys. Since these surveys were similar to 
and were completed prior to the Phase III survey which 
comprises our data, information contained in these first two 
surveys may be utilized in the construction of priors. The 
same dependent variable was contained in the Phase I & II 
dataset, along with the variables Smoker, Age, and K-risk. 
A logistic regression model was compiled from the Phase 
I & I data to describe the relationship between the opinion 
on workplace smoking and these three explanatory vari- 
ables. Normal priors were constructed for the coefficients 
of these three variables centered at their MLE’s, but with 
increased standard error. The error terms were increased 
due to three factors: 


i) There was a three year span between the Phase II and 
Phase III surveys; opinions may have changed over that 
time, possibly as a result of the impact of the bylaw. 


il) The MLE’s were calculated under the same MAR 
assumption being evaluated. 


ii1) Prior to the collection of the Phase III data, there existed 
the possibility that other explanatory variables would be 
included in the model; in the presence of other variables, 
the effect of these three could be altered. 


Although the variances were increased, the means were not 
changed, since it was unknown, a priori, in what direction 
any change might occur. Since the available Phase I & I 
data contained no information about the Calls or Bother 
variables, the coefficients of these were assigned a diffuse 
Normal (0,9) prior. For clarity, this option will be referred 
to as the “Phase I & II prior” in this analysis. 

In the second option Normal (0,9) priors are assigned to 
each of the logistic regression coefficients. One motivation 
for this choice is that, for the same three reasons the error 
terms were increased above, the variables common to the 
Phase I & I and Phase II surveys are not exchangeable. 
Thus, construction based on the Phase I & II results would 
be inappropriate. This option will be referred to as the 
“Central prior’. 

The choice to use Normal (0,9) distributions here is for 
convenience. Centering the prior at zero gives equal weight 
to either direction of the relationship. We believe the choice 
of a variance of nine to be adequate without being overly 
diffuse. The use of improper priors could lead to a Markov 
Chain Monte Carlo simulation that never converges, and, as 
Natarajan and Kass (2000) show, an overly diffuse proper 
prior may behave like an improper one. In section (7.2), we 


offer a sensitivity analysis to evaluate how the results are 
effected by the choice of prior. 

The non-response parameters of the NI model, p and z, 
were treated the same under both prior options. There was 
no additional information available about the probability of 
a successful call or the probability of susceptibility. Thus, 
p and m were each assigned a U(0,1) prior. 

The data augmentation parameters found in each of the 
logistic regression equations, B,, were independently given 
diffuse Normal (0,9) priors. For each linear regression 
equation found in the data augmentation process, the 
coefficients, B,. and variance, 6,, were set to 
PB,, 0°) « 1/o,, the standard non-informative prior distri- 
bution (e.g., Gelman et al. 1998). Note that the closed 
forms of the posterior distributions of the linear regression 
parameters are known and may be drawn from directly. 


7. RESULTS 


First, the validity of the MAR assumption is examined 
through the coefficients of the number of calls variable. 
Then, the NI model is evaluated with respect to sensitivity 
to the choice of prior. Finally, the magnitude of the impact 
of a faulty MAR assumption for this dataset is investigated 
by illustrating the change in the odds of response. 


7.1 Coefficients for the Number of Calls 


For both the Phase I & If and Central priors, Figure 4 
displays the posterior density (solid line) and 95% credible 
interval estimates (dotted lines) of the coefficient of the 
calls variable in n,, in the NI model, and compares them to 
the point B, al, =0 (dashed lines). The results clearly indicate 
this coefficient differs from zero. We also find a non-zero 
result in n,,, where, using the Phase I & II prior, the 95% 
HPD credible interval for Beau, is (-0.03613, 0.11595). 

The non-zero coefficient of C, demonstrates a depen- 
dence between the number of calls and the subject’s 
opinion on smoking in the workplace. Thus, the dependent 
variable and the non-response mechanism are not inde- 
pendent under the conditions discussed in section 5.2. This 
results implies that an assumption that the missing obser- 
vations are missing at random prior to accounting for the 
non-response mechanism is incorrect for this dataset. 

There is a hint in Figure 3 that the probability of a 
successful call decreases as the call number increases. To 
verify the assumption that the relationship between the 
number of calls and the log odds of response is linear, a 
second Bayesian NI model was constructed. This model 
split the calls variable into two, C,/ (C<7) and C,/ (C27) 
based on whether the number of one were oe than 
seven. The posterior distributions of the coefficients of 
these two variables were then compared and evidence that 
they are essentially different was not found. In particular, 
for n,, the 95% credible interval for C,/ C37 contained the 
same interval for C,J/, (C<7) and for n,, the 95% credible 
intervals strongly overlapped. 
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Posterior Density: Coefficient of the Number of Calls (j = 1) 
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Figure 4. Display of B 


call,’ 


equal tailed credible interval (dotted line), compared to B 


7.2 Sensitivity to Priors 


Would different prior distributions, either on the calls 
coefficient or on the others, make a difference in the effect 
illustrated above? Table 1 displays 95% HPD credible 
intervals for the coefficient of the calls variable in the first 
logit equation of the NI model for six different priors. The 
priors include the Phase I & II and Central priors as well as 
four others - labeled options 3, 4, 5, and 6. Options 3 and 
4 resemble the Central prior except that they change the 
prior distribution on the coefficient of the number of calls 
to Normal (1,9) and Normal (-1,9) respectively. Option 5 
places Normal (0,9) priors on Bean Bose» and B, usury a 
Normal (1,9) prior on By,, a Normal (.5,9) prior on By i545 
a Normal (-1,9) prior on Bs and Normal (-5.9) priors on 
Boo and B,No,- Option 6 takes the Central Prior and 
reduces all the variances from nine to two. 

Under all six priors, Table 1 demonstrates that the coeffi- 
cient of the calls variable in the first logit equation clearly 
differs from zero. The finding that the missing data 
mechanism is non-ignorable for this dataset does not appear 
to be effected by the choice of prior among these options. 


the coefficient of the calls variable in Nia posterior density (solid line) and 95% 


=0 (dashed line). 


call, 


Table 1 
95% HPD Credible Intervals for B.,,, Under six Different 
Prior Distributions 


Prior Coefficient of the number of 
Calls “C,” in n,, 
95% intervals 
Lower Bound Upper Bound 
Phase I & II 0.00129 0.07746 
Central 0.00446 0.07980 
Option 3 0.00447 0.07983 
Option 4 0.00441 0.07975 
Option 5 0.00440 0.07970 
Option 6 0.00436 0.07944 


7.3 Effect on Odds of Response 


Given the failure of the MAR assumption shown above, 
it is of interest to question the relevance of the error that 
using the MAR assumption would create. The magnitude of 
the error induced by a faulty MAR assumption may be 
illustrated by examination of its effect on the odds ratio 
P,(x;) /Po(x;). First, we consider the effect on a typical 
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respondent profile. The modal respondent was a non- 
smoker between the ages of 25-35 years old who was 
usually bothered by second-hand smoke, had a K-risk of 11 
and could be reached in 2 calls. We label this modal 
respondent as Subject 1. Table 2 demonstrates the change 
in posterior odds for Subject 1 when called 13 times. 


Table 2 
Comparison of the Odds of Response for 4 Typical Subjects. 
Posterior Medians Were Used As the Point Estimates for 
the Coefficients in the Bayesian Models; the Mle Was Used 
for the Frequentist Model 


Subject 1 Subject 2 Subject 3 Subject 4 


Smoker No No Former Yes 

Age 30 50 27 40 
Bother Usually Always No No 
K-risk 11 12 7 3 
Model Odds Y=1/Y=0 


MAR MLE 0.674 2.105 0.457 0.396 
MAR Phase I & II prior 0.703 4.487 0.209 0.116 


NI Phase I & II prior: 2 calls 0.640 4.024 0.202 0.108 
NI Central prior: 2 calls 0.593 4.442 0.162 0.102 
Option 3: 2 calls 0.594 4.449 0.162 0.102 

Option 4: 2 calls 0.592 4.435 0.162 0.101 

Option 5: 2 calls 0.590 4.423 0.161 0.101 

Option 6: 2 calls 0.590 4.426 0.161 0.101 


NI Phase I & II prior: 13 calls 0.974 6.128 0.308 0.165 
NI Central prior: 13 calls 0.936 7.013 0.256 0.160 
Option 3: 13 calls 0.937 7.026 0.256 0.161 

Option 4: 13 calls 0.934 7.000 0.255 0.160 

Option 5: 13 calls 0.930 6.975 0.254 0.159 

Option 6: 13 calls 0.931 6.980 0.254 0.160 


The Subject 1 column Table 2 indicates a dramatic 
difference in the posterior odds when the non-response 
mechanism is taken into consideration. For this typical 
respondent profile, when the number of calls is increased 
from two to thirteen the posterior odds of choosing 
“Smoking should not be permitted at all” over “Smoking 
should be permitted in restricted areas only” increases by 
52.18% under the Phase I & II prior and 57.84% when 
using the Central prior. This is dramatic evidence of the 
relationship between the dependent variable and the non- 
response mechanism. 

Are the results for the modal subject above typical? 
Table 2 also displays the effects on the odds of response 
under the NI model for three additional test subject profiles 
for each of the six different priors considered above. 
Subject 2 is a fifty year old non-smoker who is always 
bothered by smoke and has a perfect “K-risk” score. 
Subject 3 is a 27 year old former smoker who is not 
bothered by smoke and has a “K-risk” score of seven. 
Subject 4 is a 40 year old smoker who is not bothered by 
smoke and has a “K-risk” score of three. On multiple 
subjects with multiple priors, Table 2 consistently shows 


the same result. Increasing the number of calls to greater 
than 12 will increase the posterior odds of choosing 
category “1” over category “0”. For each of the test subjects 
and priors found in Table 2, the increase was between 
52.18% and 58.41%. 

Similar results were found when examining the odds of 
choosing the “Smoking should not be restricted at all” 
category over the “Smoking should be permitted in 
restricted areas only” category. Using test subjects which 
were a current and a former smoker (Subjects 3 and 4 
above), the posterior odds increased 46.7% when the 
number of calls was increased from 2 to 13 under the Phase 
I & II prior. 


7.4 Effect on Probability of Response 


With the shift in posterior odds illustrated above comes 
a corresponding shift in the estimated probabilities that a 
subject will respond in a particular category. Among the 
respondents, 57.45% chose category “0”, 40.64% chose 
category “1”, and 1.91% chose category “2”. The number 
of non-respondent susceptibles have a posterior median of 
469, with a 95% credible interval of (25, 944). On average, 
55.88% of the simulated non-respondent susceptibles chose 
category “0”, 40.03% chose category “1”, and 4.08% chose 
category “2”. While, for categories “0” and “1”, the average 
values for the non-respondent susceptibles do fall within the 
95% confidence intervals for the proportions of the 
respondents in these categories, the point estimates for each 
category shift when the non-response mechanism is 
included in the model. In comparing the category “2” 
results, we estimate that non-respondents are twice as likely 
to favor no restrictions on smoking (category “‘2’’) than are 
respondents. While the low number of subjects found in 
category “2” are unlike to provoke a change in workplace 
smoking law, the increasely noted in the non-respondents 
in this category serves as an example of how the lack of 
proper consideration of the non-respondents could lead to 
flawed conclusions about the data. 


8. CONCLUSION 


Section 7 demonstrates that, for the dependent variable 
of interest in this dataset, an assertion that the missing 
observations are missing at random, prior to accounting for 
the missing data mechanism, is incorrect, assuming the 
relationship among the relevant variables is the same for all 
susceptible subjects. Furthermore, the use of a faulty MAR 
assumption in the evaluation of this dependent variable 
risks serious error in the calculation of the posterior odds 
and in any conclusion drawn from them. In order to perform 
a proper evaluation of the opinion on smoking in the 
workplace in Toronto in early 1993 via the dependent 
variable of interest in this survey, it is necessary to account 
for the non-response mechanism in the model structure. 
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In this analysis, only one simple piece of information, the 
number of calls, was utilized. A more complete treatment 
could have been made, had more information been 
available. Knowledge of the exact number of calls to the 
non-respondents, instead of a minimum, and the time of day 
of the calls could have enabled this analysis to be more 
precise. In addition, knowledge of the type of non-response, 
refusal or non-availability, and the number of times the non- 
respondents were actually contacted could have allowed for 
better classification of the non-respondents. Groves and 
Couper (1998) point out that statistical errors arising from 
non-availability and those arising from refusals are likely to 
differ. As they further comment, the evaluation of how 
efforts to seek cooperation effect measurement error is an 
important area of research. 

The results illustrated above apply only to this one 
dependent variable assessing smoking in the workplace in 
this one dataset. Given the perception that smoking has 
become less socially acceptable over recent years, it would 
be reasonable to think that non-response error due to 
questions about smoking may be more severe than other 
topics. A comparison of non-response bias including 
various smoking related questions and others which do not 
concern smoking may be found in Biemer (2001); this 
comparison lends no credence to the idea that non-response 
error is unique to questions relating to smoking. 

Although the above results make no implications about 
the missing data mechanisms in other surveys, there is a 
clear demonstration here that blindly assuming that the 
respondents of a survey constitute a random subsample of 
the population for the variables of interest can be an unwise 
choice. Information, available at the time of data collection, 
can enable the evaluation of whether or not the mechanism 
which causes the non-response is ignorable. In light of this 
observation, then, it should be of interest to those who work 
with such data to make use of the available information 
pertaining to the non-response in the evaluation of that data 
and to make such information available to others who 
utilize the dataset. As a general matter, we believe that the 
collection and analysis of data on where and how 
respondents were found, as well as how difficult they were 
to find, is an important future direction for survey 
methodology and practice. 
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A. Post-stratification Weighting 


HHW, is the household weight of subject i as described 
in Northrup (1993). 


— Let m =the number of respondents. 


— Let r = the cumulative number of adults in the 
responding households. 


Let h,= the number of adults in subject i’s household. 
shull HW, siltermir. 


Proportions in the sample falling into the following age 
groups were calculated for both male and female 
respondents: 18-24 years, 25-44 years, 45-64 years, and 
over 65 years old. These proportions were then compared 
to the age/sex distribution in Metropolitan Toronto. 


— Let p,, = the proportion of adult Metropolitan Toronto 
residents falling into the same age/sex category as 
subject i, as per the 1991 Census. 


— Let p,, = the proportion of survey respondents with the 
same age and sex categories as subject /. 


— W,=HHW,: p,;/p,;, where W, is the final post- 
stratification weight used in the analysis. 


B. MCMC Implementation 


The full MCMC simulation for the NI model consists of 
a Metropolis algorithm supplemented with the data 
augmentation described in section 5.3. The following is an 
overview of the MCMC algorithm. Variables used below 
are defined in section 5. At each iteration f, 


1. Draw p, for Beta(s,_, + 1,2398 -s,_, + 1). 
2. Impute s, from Binomial (p,) > 1,429. 
3. Impute C 


ae draw (s, - 1,429) vs from 
Geometric (m, 5) andive, Ge c =v +12, 


4. Draw 1, from Beta(s,+1, Yic,,, - 5, + 1). 


Sus , 

5. Impute values for the rest of X,.. by utilizing the 
relationships with the number of calls, as described in 
section 5.3 
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6. Update the additional parameters used in the data 
augmentation of X_.... 


Update linear regression parameters, B. and o, by 
drawing directly from the closed form of their 
posteriors. 

Update logistic regression parameters, B, using a 
Metropolis step on each. 


7. tmpute| GY, Vive yee draw yrys Teirom, “a 


Multinomial (Do @,), P, %;); P> @;)). 
8. Update each B,, using a Metropolis step on the 
conditional likelihood and a Normal jump function. 
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Double Sampling 


M.A. HIDIROGLOU’ 


ABSTRACT 


The theory of double sampling is usually presented under the assumption that one of the samples is nested within the other. 
This type of sampling is called two-phase sampling. The first-phase sample provides auxiliary information (x) that is 
relatively inexpensive to obtain, whereas the second-phase sample contains the variables of interest. The first-phase data 
are used in various ways: (a) to stratify the second-phase sample; (b) to improve the estimate using a difference, ratio or 
regression estimator; or (c) to draw a sub-sample of non-respondent units. However, it is not necessary for one of the 
samples to be nested in the other or selected from the same frame. The case of non-nested double sampling is dealt with in 
passing in the classical works on sampling (Des Raj 1968, Cochran 1977). This method is now used in several national 


statistical agencies. 


This paper consolidates double sampling by presenting it in a unified manner. Several examples of surveys used at Statistics 


Canada illustrate this unification. 


KEY WORDS : Double sampling ; Auxiliary data ; Regression ; Optimal. 


1. INTRODUCTION 


The theory of double-phase sampling is usually 
presented under the assumption that one of the samples is 
nested within the other. This type of sampling is called 
two-phase sampling. The first-phase sample provides 
auxiliary information (x) that is relatively inexpensive to 
obtain, whereas the second-phase sample contains the 
variables of interest. The first-phase data are used in various 
ways: (a) to stratify the second-phase sample; (b) to 
improve the estimation by using a difference, ratio or 
regression estimator; or (c) to draw a sub-sample of 
non-respondent units. Two-phase sampling is a powerful 
and cost-effective technique with a long history. Neyman 
(1938) was first to propose it. Rao (1973) studied double 
sampling in the context of stratification and analytic studies. 
Cochran (1977) presented the basic results of two-phase 
sampling, including the simplest regression estimators for 
this type of sampling design. More recent work on the 
subject includes that of Breidt and Fuller (1993), who 
developed efficient estimation methods for three-phase 
sampling computations using auxiliary data. Chaudhuri and 
Roy (1994) focused on the optimal properties of simpler but 
well-known regression estimators of two-phase sampling. 
Hidiroglou and Sérndal (1998) proposed estimators based 
on calibration and regression for two-phase sampling to 
account for the availability of auxiliary data at both levels 
of the sampling design. 

Estimation for nested and non-nested double sampling 
has been treated separately in the survey literature. 
However, it is not necessary for one of the samples to be 
nested within the other, or even be selected from the same 
survey frame. This case will be termed non-nested double 
sampling. It has been briefly discussed in such classical 
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books on sampling such as Des Raj (1968) and Cochran 
(1977). This method is used in several statistical agencies. 
For example, at Statistics Canada, the Canadian Survey of 
Employment, Payrolls and Hours (SEPH) is using this 
sampling procedure (Rancourt and Hidiroglou 1998). In 
this survey, two independent samples are drawn from two 
different frames, which nevertheless represent the same 
universe. The auxiliary data (x), which includes the number 
of employees and the total amount of payrolls are obtained 
from a sample selected from a Canada Customs and 
Revenue Agency administrative data file. These same 
variables, together with the variables of interest (y), the 
number of hours worked by employees and summarised 
earnings, are collected from a sample drawn from the 
Statistics Canada Business Register. Another example 
described by Deville (1999) is the case of a household 
survey conducted at INSEE. 

A single estimator can represent the overall estimation 
process, and the only difference is with respect to variance 
estimation. This paper is structured as follows. Part 2 sets 
out the notation. Part 3 describes how the double sampling 
procedures can be obtained from a single estimator. In Part 
4, the estimated variance for the nested and non-nested 
calibration estimator is presented. Several practical 
examples are provided in Part 5. Finally, Part 6 contains a 
brief summary. 


2. NOTATION 


2.1 Nested Case 


The population is represented by U = {1,...,k, ..., N }. 
First, a probability sample s,(s,cU) is selected from 
population U using a sampling design with inclusion 
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probability of ,, = P(k 5s, ) for the k-th sampled unit in s,. 
Given s,, asecond sample s, (s,<5,cU) is drawn from s, 
using a sample design with conditional inclusion probability 
Ty), = P(KES, | s, ) for the k-th sampled unit in s,. Note 
that the probabilities are conditional since it is assumed that s, 
is known. Figure 1 displays an example of nested sampling. 

We assume that m,,>0 for all values ke U and that 
Tox, > 0 forall values k € s,. The weight of a sampled unit 
k will be denoted by w,, =1/7,, for the first-phase sample 
and w,, = 1/ Tx) 5, for the second phase sample. The overall 


sampling weight of a selected second-phase unit, k €5,, 
will therefore be w, = W,, Wo,.- 


Universe 


Figure 1. Nested Samples 


Let x denote the auxiliary data vector available with the 
first-phase sample, and x, the value for unit k. We proceed 
as in Hidiroglou and Sarndal (1998), that is, we divide x, 
into two parts x,, and x,,. The values of the data vector x,, 
as assumed to be known for the entire population U, while 
the values of data vector x,, are only known for the first- 
phase sample s,. 


2.2 Non-nested Case 


It is possible for the two samples to be drawn 
independently from the same frame or even from different 
(but equivalent) frames. Figures 2 and 3 provide examples 
of these non-nested cases. 


Universe 2 


Sample 2 


Figure 2. Two independent samples selected from 
different sample frames 


The non-nested case represented by Figure 3 is not 
considered in this paper. This case can be complicated for 
arbitrary sampling plans because it is necessary to compute 
joint inclusion probabilities between the two samples s, 
and s,. This computation is simpler when the two samples s, 
and s, have been selected using a simple sampling design 
such as simple random sampling (with or without 
replacement). It is then possible to use Tam’s results (1984) 
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to obtain the required joint selection probabilities for the 
computation of the estimated variance for a given estimator 


of the total Y = L,,y,. 


Universe 


Figure 3. Two samples drawn independently from 
the same sample frame 


For the case that we will study, we assume that samples 5; 
and s, are drawn independently from two different frames 
Ue del agse Ritess Vi} AND Ue thet 9) (See 
Figure 2). The inclusion probabilities of a sampled unit k 

: ) (2) _ 
are respectively 7, = P(kes,) >0O and x, = 
P(kes,)>0 for samples s,(s,c U,) and s,(s,cU,). 
The weight of unit k is w,,’ = 1/mp for the first sample s, 
and wr? = l/n for the second sample s,. The super- 
scripts (1) and (2) are used to differentiate between the 
selection probabilities of the samples drawn in the nested 
case. The sampling units may differ between the two 
frames, but these frames represent the same coverage. 
Examples of such sampling procedures were mentioned in 
the introduction and more details are provided in the second 
example given in section 5.3. 

Let x, =(%,,,X5,)’, be an auxiliary data vector. We 
assume that x is known for all units belonging to frame U,, 
while x is only known for sample s,. We collect yaw an 
from sample s,. The x data collected for corresponding 
units in samples s, and s, may differ. The degree in 
difference between the data values will vary according to 
the complexity of the sampling unit, and how much these 
units differ in concept between the two sampling frames. 
For « simpler » units the data reported for « similar » units in s, 
and s, should be equal or almost equal. Departures in the 
data similarity for the same units in s, and s, would most 
likely be due to the different questionnaire wording or due 
to different respondents filling in the questionnaires. 
Nevertheless, we assume that X, = 2, xy = Ly, x\? since 
U, and U, have the same coverage. 


3. OPTIMAL ESTIMATOR FOR NESTED AND 
NON-NESTED SAMPLES 


In both cases, nested and non-nested, the objective is to 
estimate the population total Y = © y Y, where y, represents 
the value of unit ke U. An unbiased estimator of Y is 
Vy = X,, “Ye Where w, =), W,, for the nested case 


. 2 
and w, = wy for the non-nested case. 
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The sampling weight of a unit is modified by multiplying 
it by the calibration factor obtained using the various levels 
of the auxiliary data (universe, first-phase sample). The 
product is called a “calibration weight”. Table 1 
summarises the available data for the nested and non-nested 
cases, corresponding to Figures | and 2. 


Table 1 
Data Available for the Population and Samples 


Set of Nested Case Non-nested Case 

Elements 

Population x,,: known for ke U cae known for ke U, 

First 

sample x,: observed for kes, 7 : observed for ke Sy 
Second 

sample y,, X,: observed for kes, y? x observed for kes, 


The following regression estimator is used to estimate 
the population total Y for nested and non-nested samples: 


A 
a 


Yorc ees (Ay A) Ba - xX) B (3.1) 

The various totals corresponding to the auxiliary data x 
and y-variable of interest given in equation (3.1) are 
provided in Table 2. 


It is assumed that the variances, VCP, yr)» and covari- 


ances Cov (X, x’), Cov (X,, xa "Cov (Xj, 4) 
CoviCYar Xx”) and ae he are known or 
estimable. 


To simplify the notation, we drop the superscripts for the 
remainder of this section. The estimation of the parameters, B 
and B, as well as of their associated variance, reflect that 
we have sampled differently for the nested and non-nested 
cases. The estimators of B and B, are obtained by 
minimising the variance of ee Gg Lhis Natience is: 


V(Yeac) = V(Yu 1) +B; V(&)B, +B v(X-X)B 
-2.Cov( Yup X! )B, + 2Cov( Yip (X-X)’ )B 
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Deriving (3.2) with respect to B and B,, we obtain the 
following two equations: 


V(X-X )B+ Cov((X-X), 


» 
=< 
4 

“— 


-Cov((X-X),X/)B,=0 3.3) 
and 
-Cov(X,,(X-X) )B -Cov(X,, Yur) + V(X,)B, =0. (3.4) 


Solving the system of equations (3.3) and (3.4), we 
obtain the required parameters B and B.. That is: 


(3.5) 


+ (Cov (X,, (& -X)')) V(X, )Cov(X,, Yur) 
and 
Bret, Hi: (3.6) 
where 
T, = V(X,), 
and 


E H, = Cov (X,, Yu, ) + Cov(X,, (¥ -X)’)’ 
- 2 Bi Cov(X,,(X - X)’) B. (3.2) 
Table 2 
Sums of the Auxiliary Data x and y for Nested and Non-nested Cases 
Set of Elements Nested Case Non-nested Case 
Population X,= ae re X= we x) 


First sample 


Second sample 


X=), Wi Xp? ;X = De WiX, 


PoE Wide ;X = S. With 
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Result 1: An optimal regression estimator for the nested 
and non-nested samples is: 


Yopr = 1% +(X -¥,)'B, opp + (¥ Dy Boe (3.7) 
where 
Bera T, (3.8) 
and 
x Aree 
B. aeeeit Wy (3.9) 


T H “eg T and Hi are the estimated values of 17, i, & 
and H, and they are obtained using a framework leading to 
the inference based on the sampling design. These values 
are dependent on the sample selection scheme. The popu- 
lation variance of Y>,, and its associated estimated 
variance depend on whether or not the samples are nested 
or non-nested. Since the regression vectors are optimal, it 
follows that the regression estimator Y,,, is also optimal. 
The optimal form has been discussed by Montanari (1987, 
1998, and 2000) for the case of a single phase sampling 
design. 


3.1 The Case of Nested Double Sampling 


The theory for this case is developed using a conditional 
approach. Suppose that two parameters are given by 8, and 
6,, and that they are estimated by 8, and 8, from sample s,. 
If we condition on the realised samples,, then the following 
well-known results hold: 


(i) Theexpectation of 6 is E (6) = EE, (6 is ), where E, 
denotes the expectation of 0 given s,. 


(ii) The variance of 6 iS 


V(8) = E, V,(6|s,) + V, E, (8|s,). (3.10) 


(ii1) The covariance between 6, and 6, is: 
Cov (6, 6! Ne E Cov, ((6,, 6;)|s, ) 


5; ), E,(6;|s,)). 


The various components of ip H, fe and of H , will be 
estimated assuming an arbitrary sampling design with a 
non-fixed sample size. The case of a fixed size sampling 
design follows easily as it is a special case of the arbitrary 
sampling design. Using expressions (1) — (iii), we can re- 
express the terms defining parameter B as: 
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a 
and 


Cov(X, Parr - Cov(&, Yirr)* Ei] Ds Sant) s, Fee 5 (3.11) 


and Yie= 
, Lhe inclusion es AN in these expressions 


h a / * * 
Where Cox9)5 = (Mage) 5, ~ Max|s,7 20) s,)/ Me™ 
a, Jk /T, 
are Toke | 5, 
We can express B more simply as: 


B “[Ai(% Die Cope), FX) | 


=Pr(k, les, | s,) and Tt, = Ty Moy) 5 


Ey » De Cope) s, Fee] 
and the corresponding optimal estimator is given by: 


aR W 5 poled 
B opr = pe © Crees, Fe | 


(3.12) 


pe Dare Canes, ee | 


where Coxe | s, = Cope | sy! Bae | sy 


The optimal regression estimator B, opr? IS given by 
(3.9) with 


(3.13) 


Tov r 
and 
H, = Cov(X,, Yr) + Cov(¥,,X") Boor 


Each component defining T, and H , 1s estimated as fol- 
lows. We first estimate V (X,) = 25 d, Cie hy hae DY. 


VX )a De Cut u® (3.14) 
WhETE Cy pq = (Mi yy ~ Hyg yy )/ (yy My) AN Cyyp = Cypg/ May: 
Next, since 
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5 yi Di Crete r10 


(3.15) 


Lon by 


Cov(X,, reece =>) Der Cree XE Ne 


we estimate Cov (X i 


(3.16) 


where 
: / 
Caps = epi! Mee Max = Mee M0} 5? 
Tie rik eS.) 


M405 PEG eS, Sq) 


and 1,=1,,% 


2k| 5," 
Similarly, 


Cov(X,, x’) =) Db CieX Ye 


(al7) 


and 
Cov(X,, x") x Dy Dae Cree X 1% - 


Hence, in the case of nested double sampling the optimal 
estimator of B, is given by: 


(3.18) 


A 


Bea 


+(Cov(X,, X’) - Cov(¥,,X)) Bor] G.19) 


where the components of B, ae 
expressions (3.14) — (3.18). 

The optimal form of estimators B, oppand Bo pr has its 
advantages and disadvantages. One of the “Digeest 
advantages of the optimal form, as reported by Cassady and 
Valliant (1993), Rao (1994), and Montanari (2000), is that 
it has good conditional inference properties (by condi- 
tioning on the auxiliary variable x). As Montanari (2000) 
observed, the asymptotic optimality of poe is strictly a 
property based on the sampling design and achieved 
conditionally on the finite population. The biggest 
disadvantage of the optimal estimator is that it requires the 
computation of joint inclusion probabilities. 

We can, however, use the optimal form, and express it 
more simply for several sampling designs. For sampling 
designs where the sample selection is with unequal 
probability and without replacement, we can bypass the 
computation of the joint probability by approximating the 
exact variance. Several authors, including Hartley and Rao 


have been defined by 
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(1962), Deville (1999), Berger (1998), Résen (2000) and 
Brewer (2000) proposed such approximating procedures. 


Recently, Tillé (2001) proposed the following 
approximation for the estimated variance of Y,,, = © y,/1, 
in the context of single-phase sampling, where 
‘lots Cy 2 
V( Yr) pad LEN, ~ Vk : 
To, 
? 
= Sa G ofa —\ 
ak I, y (3.20) 


Here, c, is the variable used as the approximation, 
Ye =H, LC yY/% IX, F¥=Y,/m, and nm, is the 
probability of selection of a given unit k. Tillé (2001) 
provided several examples of the c, values for various 
sampling schemes. 

This formula is exact in the case of a stratified simple 
sampling design drawn without replacement in each stratum 


U, (h =1,...,L) of population U. Let k be a sampled unit 
in sample S 7 airompfasttatumy @b/)5) stheny ae, = 
n,/(n,-1)(Q- 1,/N,) if ke U, and O otherwise, and 


Tae, LN, ak ok & U, ae 0 AIS This pues us the 
exact eiteacd variance, V= Coe Dae iN, ‘(hes n,/N,) 
5 Oe ,)-In,(n,-1). The formula is also exact in the case 
of a stratified sampling design where the sample is selected 
with replacement. Here c, = 1 for all units belonging to 
stratum U, and zero Are Using this approximation, 
the double sums appearing in Bop, and B, Orr cal De 
expressed as simple sums. Hidiroglou and Sarndal (1998) 
bypassed the problem of double sums in estimating Band B, 

by proposing the GREG estimator, Y, for a nested 


GREG? 
two-phase sampling design. Their estimator is given by: 


A a Bes a Aare 
Yorec = Yurt +(X, x) Bil oaeat x) B ores 
where 
~, , =) “ 
B a 5s WiipWae* ne > WW * Vx 
GREG Sy 2 Sy 2 ? 
09 On (3.21) 
WER Ale 
B “is 3 1k™ 1% 1k 
1,GREG — Sy 2 = 
O71, 
/ 
> Wy Xi, Vy > Wi XrXK 
Sy 2 Sy 2 GREG 
Ik O1K 
e. Weexy Xe 
Sy 2 REG 
O71, (3.22) 


with { Oi, KES; } and { 05,:k € S$, } being predetermined 
positive factors. 
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Estimators Bopag and B, cppg can be justified either by 
assuming different regression models for each phase or by 
using two successive calibrations. For the calibration 
approach, calibration weights w,, associated with the first- 
phase are first wae and they satisfy the calibration 
equation © W,,x,, = 21x), These calibration weights 
can be expressed ; as the product of sample weights w,, and 


a calibration factor g,, where: 


wal +(Yiy os a et Wide) 


Pe) AGS) 


for kes,. 

The first-phase calibration weights w,, are then used as 
initial weights to compute the overall calibration weights 
Ww. These overall calibration weights satisfy the second- 
phase calibration equation ©, w We ps Wyx,- The 
estimator of the total, y GREG? can be 5 as the sum 
of the product of the overall calibration weight w, and the 
associated y-value, that is Young = ©, |W, Y,- The calibrated 
overall weights can be expressed as wW = Wy 8, where 
& = &1,8y- Here, g,, is given by (3.23), while g,, is 
equal to 


ga rs rt — / 

84 = 1 yeas Wek DSS. Wap Wry X 4) 
_ aa ee | 

> Wi WX Xp xy 

Sy 2 2) 


65, (3.24) 
for kes,. 

Comment: The estimators of B, crec (3-21) and B, GREG 
(3.22) correspond to Hidiroglou and Sarndal’s (1998) 
additive case and have the same form as the optimal 
regression estimators B opr (3-8) and ie opy ow): mndeed, 
the components of the’ estimator of B are obtained by 
respectively suet: by (X. Wi Xj i 65) and H 
by Low, .X,);/ 6%. The second terms of H and T are 
exactly aan to zero. Similarly, to estimate B,, the 
component T, is estimated by L. Wak eXil Oe while 
H, is estimated by 


x 1 * ! 

Sy Daa) -y Wik in k Sy We vip ee B 

So 2 Sy 2 oy) GREG’ 
Oi Ox G1, 


The estimated variance of fa - eter 
(X, =X, 1) By crc * (XE: X)'B is presented in 
Hidiroglou and Sarndal (1998). 

Comment: The efficiency of the GREG, as stated in 
Sarndal, Swensson and Wretman (1992), requires that the 
proposed model be correct. Furthermore, if the sample size 
is large enough, optimal estimators are more efficient (Rao 
1994) than the GREG. However, if the sample size is 


GREG 
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relatively small, one disadvantage of the optimal form OPT 
is that it is generally less stable and more complex to 
compute than the GREG. Furthermore, an additional 
consequence of a relatively small sample size, as reported 
by Sarndal (1996), and illustrated by simulation by 
Montanari (2000), is that if the sample size is relatively 
small, then the optimal form is not significantly more 
efficient than the GREG. It is even possible for the 
estimated variance to be greater than that associated with 
the GREG. 


3.2 The Case of Non-nested Double Sampling 


Deville (1999) considered the non-nested case (Figure 2) 
by assuming that x,, is known for s, and s,. The optimal 
regression estimator is: 


Pion cal gun | Xora Bota (3.25) 


nes. oh (2) & ad) ¥ (2) 
where Yi = Le Wa Vie> X,=2 5 Me % 2» Xp = =. MK 2k 


The optimal estimator for B, Te) So Xad) ‘Ly Xo V_ is 


By opp = (V(X) + ve)" Cov(F,,,,%;) if the two 
sampling frames U, and U, are adore rede The form of 
the variance and i the covariance terms defining B 
depends on the sampling design of s, and s,. 

The accuracy of the estimator of X, can be improved by 
minimising the variance of X,=A,X,+(I- ING 
yielding, A =(V(X,) + V(X,)) V(,). Sint: that 
V(X,) iS approximately a multiple of V(X), that is 
V(X,) = 0., V(X, ), we obtain A, = //(1 + a@,) where J is 
the identity matrix has the same dimension as the 
covariance matrix V(X, ). The optimal value of a, is 
obtained by minimising the variance of X,. A sub- optimal 
but adequate choice, suggested by Deville (1999), for a., is 
a, =n,/(n, +n,), where n, and n, are the respective sizes 
of samples s, and s,. Note that Korn and Graubart (1999) 
also made the same suggestion in the context of combining 
two totals estimated from two different sources. 
Substituting X, in place of xX, in expression (3.25), yields 


2, OPT 


Keak hy gl eGlanio (3.26) 
The estimator of the population total Y, is: 
Nock 3 Veen i (Xx, a X,) B, opr Ce) 


(x,-X%,)'). @.28) 


If (3.26) is substituted in (3.28), we can re-express 
2, opr 2s: 


Fea AB. | an 9 @ ie. Mpa ee) 


Survey Methodology, December 2001 


ee We see that i, opr (3-25) is exactly equal to 
Yopr (3.27). This implies that there was no advantage in 
using a better estimator of X, to estimate Y. However, the 
estimator B, opr associated ah fe pr looks more like a 
traditional regression estimator, hon the regression 
estimator B, opr associated with Va 
Note that the GREG estimator for the case where Xe is 
used instead of X, is: 


Yorso = Yur * (Xo - %2)’ Bo, crus (3.30) 
where 
i Q)\ery (2), (2) /_2 
B) crec -(., WoX x ee ke Age atk Vk ih O24. 


Furthermore, if we also know for keU, where 
Xe x a we can consider the cis estimator 


Mone erglonst | Born? AX = XxX) B pp 831) 
We obtain x by minimising the linear combination 


AX+ d- -A)X _and V(X) = a V(X). The difference 
between X and X can be re-expressed as 


X-X=(k -X)/G~0). 


Given that s, and s, are independent samples, it can be 
shown that: 


(6:32) 


Boe Vix) covey, Fn) Co) 
and that 
Bien Ay Xe) [Cova aye) |uren (3-34) 
The ee of B opr. are estimated by: 
y= Ye, Sete me (3:35) 
and 
COv/ 2 Vay) hen (3.36) 


whereas the components of B are estimated by: 


1, OPT 
V(X,) ae De Cee. 3.37) 
and 
Gey (ie, ae Oe (3.38) 
where 
Mp9 ~ MyM 


Ce : 
2ke 
(7,4) (Mo, ) 
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Approximation (3.20) can also be used to estimate the 
terms (3.35) — (3.38). The corresponding GREG which 
bypasses the computation of joint selection probabilities is 
given by: 


“Yarn *(¥1-¥1) By creo *(¥ -4)’ Borsa 3.39) 
es KX, =Ly eX = a WX kes ew ke and 
-D,Wapt a | 


Gites regression estimators in equation (3.39) are 
estimated by 


(2)_.2(2),.)-! (2). (2) 
~ 1k * 1k Xi Ve 
B, GREG de, 2k Des 2k 2 (3.40) 
Oi, Oix 
and 
(2) _4(2) (2). (2) 
~ XX ry 
4s k wk k 7k 3.41 
Bore = De VEw ws fi) aseremorma Ggp 
On Or 


4. ESTIMATOR OF THE VARIANCE FOR THE 
OPTIMAL REGRESSION ESTIMATOR 


4.1 Nested Double Sampling 


Recall that the optimal regression estimator of Y is given 
by 


fore Pag 1 = Xl Brora Xs ke) Boeraunttal) 

To obtain the estimated variance of (4.1), we re-express 
the terms associated with the y-variable within Boe and 
B, opr a8 a simple sums instead of double sums. Montanari 
(1998) described this algebra for an arbitrary single-phase 
sampling design. Following Montanari (1998), and 
adapting the single-phase algebra to double sampling, we 
obtain: 


B pr=[d pee ae Xe | LL ds Ciao, | 


* 


a 
bz 3a Cage), | | De. a 
1 


(4.2) 


where 


1 Myx) s, x (Taxes, ~ MK) 5, 79), 
i et : Beene eee Xe 
Ty ib Toxo) s, %e 


We approximate silt pale given by (3.15) by 
[ V(X,)]"' [Cov (X,, Pr) J, and hence, 
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ps Des Ce Fatt | 


Fi, 
ea ee, (4.3) 
where 
eid Se (Axe is Te Te) 


Ty 0+k Ty MiKo 
Ges) 


10° 


By substituting (4.2) and (4.3) in (4.1), and by 
subtracting the population total Y, we get: 


“-t-| ay bu ~~ dey | 


where 
a ’ A a -] 
g,-1+(X,-%,)(W(k,)) "a, for kes, (4.5) 
and 
pyeulas (ae x)’ (V(x)) 1a, for kes, (4.6) 
Result 2: The estimated variance of an defined by 
equation (4.1) is: 
Vi Yoon) Ds oe Cie Big Biv e 1k “10 
so Bpblo & Coxe Box Soe Cre 0 (4.7) 
where 
Fike week 
Cike = (Mee = Finis) u). 
Meg M1, My 
* (Taxes, ss Mk), M/s, ) 
2 Bea ae 
Mpa |s, HK ™ 
C1, = Ye ~ XB) opr’ 
and 
C4 = Ye ~ X,Bopy- 
4.2 Non-nested Double Sampling 
We obtain the estimated variance of Vee by using the 


following approximation. 
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(4.8) 


Vee Vea +(x, -X,) By opr + (X -X) Boor 


Decomposing hes 
we have that: 


(4.9) 


pr into more elementary components, 


$ oO 
i ( XB, opp —— 41 


The variance of Voor as: 


Viton) vf Ley 


+ 20(Bopr VIX) Bi op +Cov(X, X))B orr} (4.11) 


Result 3: The estimated variance of Y,,,, v( yen 
defined by equation (4.8) is approximately equal to: 


+2a( Boor VX)B, opt COv(X,,X") Boor) (4-12) 


Computation of the first term of (4.12) is based on the 
residuals y/ (Ga Boon +X, Dopp) / Mirae The 
computation of the other terms of (4.12) is mainly based on 
the estimated variances of X , and of X, as well as on their 
estimated covariances. We can use the approximation of the 
variance, as described by Tillé (2001), and suitably adapt it 
to estimate the required covariances. 
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5. SOME SPECIFIC EXAMPLES 


Three traditional examples for double sampling are 
presented for the two cases (nested and non-nested). 
Furthermore, we briefly describe how two major business 
surveys carried out by Statistics Canada use double 
sampling. 


5.1 Nested Sampling 


Example 1: Let us assume that a simple random sample s, 
of size n, is selected from a population U of size N. The 
sample is stratified into L strata s,, each of size n,,. 
Random samples s,, of size n,, are then selected without 
replacement in each stratum s,,. The estimator of the total 
is ie =N Ey-1 Pin Yon = N Yq, go» Where p,,=1,,/1n,. 

Using G4. 7), we can show that the estimated varies of 


lee Vf. xp)» consists of the sum of V, ie xp) and 


V, (P Exp) Corresponding to the first and second phases of 
the sampling design. Thus: 


eles Fe V, ee) S Voll ce) 
where 
ee (1-f,) = Z 
Vili ee) =N?—_—* Pyl( Says, 
Nicks 
n 
‘ = Van ~Fa, sf , 
ee CLs tos) pA 
Veer) = ry = Di 
h=l Nop, 
and 
(~My) Sy ips denen 
Ny, (Ny mee nid My 
4 1 ™ 
Sis ? = an (Y, Vig) 3 
Ny, 
ad 1 
Y2n, = ae Ve 
Ny, 


v4 
ands), = > Pin Yon: 


Example 2: Let us assume that, for the sampling design 
described in Example 1, we also have auxiliary data, x 
available in the first phase s,. If we assume that the slopes 
(B,) vary among the strata, we can assume that the 
following model y, =x, B, +e, holds, where 
E(e,) = 0, E(e) = ibe ne Ly. Leand-E (se, €)) =O 
for k #0, for k, Geis bal. als This model gives us a 
separate regression estimator, that is, 
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where 


52h 2 
ohn By 


Sop, 
Nop, 


ye Mh ae 


The variance of Y sep, REG 1S estimated as being the sum 
of the variance components of each phase. These com- 
ponents are V, ( a xp) and V,( Ve se, where V, (Y a) 
was defined in Giannis 1. Variance V(¥, SEP, REG), is 
obtained by replacing variable y, by e, = g, ( Ye“ Xy 'B oD 
in V,( i exp): The estimated variance of Y, 
tee 


SEP,REG 1S 


eee N71 ti dxé 
Veep, nec)" Pi, - Tim Kise 
1 
ire wetabaes 17e 
x 7-102 — Vo st) 
= Nea) 2 §2 
rad ty 1h > 2eh 
where 
$ (é, Cas 
& = 
2eh Deen Ny =| 
and 
G2 1 2 
Soyn = a So (%, 7 Von) 


5.2 Non-nested Sampling 


These two examples are taken from Des Raj (1968, 
pages 142-149). We are using them to illustrate the results 
of sections 3 and 4. We consider two different sampling 
designs. 

With the first sampling design, we assume that: (i) the 
first sample s, of size n, is selected with a simple random 
sampling design without replacement from population U; 
and (ii) the second sample s, of size n, is selected either by 
using measurements of size x, found in the first sample s, 
(nested case) or by selecting it independently (non-nested 
case) from the first sample s, in a manner proportional to 


Se 


size x, (known for all units of the population). The resulting 
estimator is 


For the second sampling design, we assume that the two 
samples s, and s, have been selected using a simple 
random sampling design without replacement. Here again, 
we examine the nested and non-nested cases. We assume 
that we find the auxiliary observation x, for any unit 
selected in the first sample s,. The estimator is Yeu: 
(N/n,&, x )(2, ld, x)= cat Table 3 ete de 


these two sampling designs, as well as this corresponding 
estimators with their estimated variances for the nested and 
non-nested cases. 

The undefined terms in Table 3 are given by p,; = 


ee XP; = sain Vi Yee 1/n, Dyp,(y,/p,-Y)?; 
53 = (NPD OFF Re) er IN  FereIN 
ee Y/X. 


Table 3 shows that there is little difference in the 
variances between the nested and non-nested cases. For 
tee ap? the variance will be smaller for the nested case if 
the coefficient of variation (CV) of variable y is smaller that 
that of variable x. For ee the variance will be smaller for 
the nested case if pCV(y)<CV(xX) where p is the 
correlation between y and x. 


5.3 Two Statistics Canada Surveys 


Several Statistics Canada surveys use double sampling. 
We will illustrate the ideas presented in this paper using 
two business surveys. These surveys are the Quarterly 
Retail Commodity Survey (QRCS) and the Survey of 
Employment, Payrolls and Hours (SEPH). The Quarterly 
Retail Commodity Survey uses nested double sampling, 
whereas the Survey of Employment, Payrolls and Hours 
(SEPH) uses non-nested double sampling. 
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The Quarterly Retail Commodity Survey: The 
purpose of the (QRCS) is to obtain detailed information on 
retail commodity sales on a quarterly basis. The RCS is a 
sub-sample of the Monthly Survey of Retail Trade (MRTS), 
a monthly survey. The MRTS measures mainly sales by 
trade group (group of three or four-digit codes of the 1980 
Standard Industrial Classification (SIC)), by province and 
for certain census metropolitan areas (CMA). The target 
population is statistical companies with statistical locations 
identified on the Business Register and which are active in 
the retail trade. About 16,000 companies are interviewed 
each month. The population is stratified by province, 
territory, certain CMA and by trade group. 

The MRTS is stratified in H strata, based on size (2-3 
groups), geography (10 provinces, 2 territories) and 
industry (16 main groups). This sample is restratified 
independently for the QRCS. The QRCS stratification 
differs from the MRTS geographically, by size and by 
industry. A sub-sample is selected using the “new” 
stratification of the MRTS sample. The QRCS estimate is 
based on a double-ratio estimator that uses auxiliary data 
(sales) from the MRTS. The second-phase sampling unit 
(QRCS) remains the statistical company. The first-phase 
sample is restratified by trade group, by province and by 
size based on the most recent information from the MRTS. 
For stratification purposes, each company is assigned a 
province and a dominant trade group based on the one that 
generates the most sales. The two-phase estimator is used 
by the MRTS. Binder, Babyak, Brodeur, Hidiroglou, and 
Jocelyn (2000) derived a variance estimator that took into 
account the sampling design and the estimation method. 
They expressed variance estimators of the total as simple 
sums of appropriate residual terms for the case of the ratio 
estimator. 

The results of Binder et al. (2000) can be adapted to 
incorporate the optimal regression estimator in each phase. 
We assume that the auxiliary information (x,,) is known at 


Table 3 
Two Sampling Designs with Nested and Non-nested Samples 


Sampling design 2 


Nn, (SRSWOR) 


Sampling design 1 
Sampling Design 
n, +n, (PPSWOR) 
Estimator 
f io N y; 
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N A 
ihe yee te Op 
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the level of population U, either for each unit k ¢ U or for 
the total X,, = Day tie The QRCS sampling design can be 
formally stated as follows. The population is stratified in H 
strata U,;h = 1,...,H, and simple random samples without 
replacement s,,, of size n,,, are selected in each stratum 
U,. The x, variable is observed for each unit belon ging to s,. 
The resulting first-phase sample, s, =U,_,5,,, is then 
stratified in strata By Ao 1,..., G. The stratification of 5; 
is independent of the stratification of the universe U. A 
simple random sample s, 29 of size Ny is then selected from 
each stratum S18 = beerGewe observe (y,*,), where 
Ky = (Xv) fOr Ba unit belonging to sample 
Spi oa Soo: We assume that models y, = x},B, + €,, and 
y, =x, B +¢,, hold for s, and s, respectively. For each of 
these models ¢, ~ (0, o; Z,;,) ande,, ~:(0, o5 Z,) where Z,, 
and z,, are known positive factors. If z,, # 1 or z, #1 for 
all units k ¢ U, the data can be standardized by dividing 
them either by ee or V2 . The resulting optimal 
regression estimator for the total Y is given by: 

Vopet met = (be at 


1, OPT OPT 


where the components of Ve were defined in section 3.1. 
The_simplified form (without double sums) of the variance 


of Y ere: 
41> ee) Sh 
Vien) =) N, (1 - f)— 
h=l lh 
he 
72 nie (1 Sire 
Be 2g 


G Mn G M2n 
42 1 sy Ute 2 tama vk Nie. 
Sin = , —e,-—| ee iE | [2 
nN, gelk=1 No, n\ g=l k=1 Ny, 
Np, 
Se = 1 5 5 = 
2he 7 ae Te 
Noho k=] 


and 


Se = Deir en |e 


The means in these estimated variances are 


‘i 1 "2ne 
©) (hg) — oe 1 Sing) = 7 Donicig 


Nyn¢ k=1 
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and 


Here, Nong is the number of units selected in sample s 
belonging to ‘the intersection of strata U, and Si, . Also, the 
required residuals are @,,=8,,(y,-%; Be ort) and 


Eg Boy kp Boo) The adjustment fox 8), and 
>, are as defined in section 4.1. 


The Survey of Employment, Payrolls and Hours: The 
objective of this survey is to obtain estimates of the number 
of paid employees, the average weekly payroll and other 
related variables using various combinations of industry and 
province. This survey was recently redesigned to use admi- 
nistrative data for all businesses included in the survey 
universe. The survey produces estimates based on both the 
administrative data (ADMIN sample) and data directly 
obtained by a survey known as the Business Payroll Survey 
(BPS). 

The ADMIN sample s, consists of some 200,000 units 
selected from universe U, of the pay deduction accounts to 
obtain the administrative data. The sampling design for this 
sample is stratified Bernoulli (by region), and the sampling 
rate varies between 10% to 100% amongst the different 
strata (region). The size of the sample represents approxi- 
mately 20% of the total number of pa a deduction accounts. 
Only two variables represented as (x\) k are available from 
the administrative source: these are the number of paid 
employees and the gross monthly payroll. 

The BPS sample s, consists of approximately 10,000 
establishments drawn from the Business Register U,. The 
BPS collects the same two variables as the administrative 
source, namely, the number of paid Epos and the 
gross gercs payroll denoted as Ge ), several other 
variables (ey ) of interest defined by type of employee 
(employees aid by the hour, salaried, active owners, other 
employees), and variables of interests, snes as the number 
of paid hours and weekly earnings, aye zy: More infor- 
mation on the BPS is provided in Rancourt and Hidiroglou 
(1998). 

The BPS is stratified by industry type, geographic region 
and size (varying from two to three groups based on the 
number of employees). These strata were designed to take 
into account the different regression models between ve 
and Te The resulting estimated regression coefficients are 
used to predict y , for each sampled administrative record. 
There are two steps involved in the estimation of the total 
for a given variable of interest. First, the sampling weights w iS 
associated with the administrative data are calibrated using 
known regional population counts, N,, for regions 
U,,i=1,...,/. The adjusted ay of a sample unit k 
belonging to region U,, is W, = by where g,; = 
NIX, Me and s,, = 5, 0 Oe cot ve Dis regressed on 


xe : using subsets s, ,j = 1, . J, of the s, sample. The s, ; 
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subsets, classified by industry, region and sometimes size, 
are formed in advance to obtain the best possible regression 
fits. For each subset Sy jp the estimated regression vectors 
B, are obtained as: 


MP, QO Qyae (2) x?) y® 
sl deay Wy XX; /6;| Die wp x le 


Vale, 


where Ww is the sampling weight for each sampled 


establishment, and 6; are known positive factors that 
control the impact of outliers or define the required estima- 
tor. For example, if 6 i is proportional to one of the compo- 
nents of x ‘ ‘6 we obtain the ratio estimator. The estimator of 
total for a variable y is therefore Y= ie ea. w xiOB , 
where s, , is a partition of s, eoree ponding: to the subsets 
defining’ 5, , SEPH is an example of a non-nested double 
sampling sampling design. More details of the SEPH 
redesign are available in Hidiroglou (1995) and Hidiroglou, 
Latouche, Armstrong and Gossen (1995). 


6. CONCLUSION 


Nested and non-nested double sampling are usually 
treated separately in the literature. Given that the population 
total Y is of interest, and that there is auxiliary information 
available, this paper has unified the estimation procedures 
for these two sampling methods using an optimal regression 
approach. Also, for the nested case, the procedure has been 
linked to the GREG procedure proposed by Hidiroglou and 
Sarndal (1998). For the non-nested case, the method used 
by Deville (1999) has been extended when there are also 
auxiliary data at the population level. Lastly, practical 
examples were provided to illustrate this theory. 
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Estimation Using the Generalised Weight Share Method: 
The Case of Record Linkage 


PIERRE LAVALLEE and PIERRE CARON! 


ABSTRACT 


More and more, databases are combined using record linkage methods to increase the amount of available information. 
When there is no unique identifier to perform the matching, a probabilistic linkage is used. A record on the first file is linked 
to a record on the second file with a certain probability, and then a decision is made on whether this link is a true link or 
not. This process usually requires a certain amount of manual resolution that is costly in terms of time and employees. Also, 
this process often leads to a complex linkage. That is, the linkage between the two databases is not necessarily one-to-one, 
but can rather be many-to-one, one-to-many, or many-to-many. 


Two databases combined using record linkage can be seen as two populations linked together. We consider in this paper 
the problem of producing estimates for one of the populations (the target population) using a sample selected from the other 
one. We assume that the two populations have been linked together using probabilistic record linkage. To solve the 
estimation problem issued from a complex linkage between the population where the sample is selected and the target 
population, Lavallée (1995) suggested the use of the Generalised Weight Share Method (GWSM). This method is an 
extension of the Weight Share Method presented by Ernst (1989) in the context of longitudinal household surveys. 


The paper will first provide a brief overview of record linkage. Secondly, the GWSM will be described. Thirdly, the GWSM 
will be adapted to provide three different approaches that take into account linkage weights issued from record linkage. 
These approaches will be: (1) use all non-zero links with their respective linkage weights; (2) use all non-zero links above 
a given threshold; and (3) choose the links randomly using Bernoulli trials. For each of the approaches, an unbiased 
estimator of a total will be presented together with a variance formula. Finally, some simulation results that compare the 
three proposed approaches to the Classical Approach (where the GWSM is used based on links established through a 


155 


decision rule) will be presented. 


KEY WORDS: Generalised weight share method; Record linkage; Estimation; Clusters. 


1. INTRODUCTION 


To augment the amount of available information, data 
from different sources are increasingly being combined. 
These databases are often combined using record linkage 
methods. When the files involved have a unique identifier 
that can be used, the linkage is done directly using the iden- 
tifier as a matching key. When there is no unique identifier, 
a probabilistic linkage is used. In that case, a record on the 
first file is linked to a record on the second file with a 
certain probability, and then a decision is made on whether 
this link is a true link or not. Note that this process usually 
requires a certain amount of manual resolution that is costly 
in terms of time and employees. 

We consider the production of an estimate of a total (or 
a mean) of one target clustered population when using a 
sample selected from another population linked to the first 
population. We assume that the two populations have been 
linked together using probabilistic record linkage. Note that 
this type of linkage often leads to a complex linkage 
between the two populations. That is, the linkage between 
the units of each of the two populations is not necessarily 
one-to-one, but can rather be many-to-one, one-to-many, or 
many-to-many. 


caropie @statcan.ca. 


To solve the estimation problem caused by a complex 
linkage between the population where the sample is 
selected and the target population, Lavallée (1995) 
suggested the use of the Generalised Weight Share Method 
(GWSM). This method is an extension of the Weight Share 
Method presented by Ernst (1989). Although this last 
method has been developed in the context of longitudinal 
household surveys, it was shown that the Weight Share 
Method can be generalised to situations where a target 
population of clusters is sampled through the use of a frame 
which refers to a different population, but somehow linked 
to the first one. 

The problem that is considered in this paper is to 
estimate the total of a characteristic of a target population 
that is naturally divided into clusters. Assuming that the 
sample is obtained by the selection of units within clusters, 
if at least one unit of a cluster is selected, then the whole 
cluster is interviewed. This usually leads to cost reductions 
as well as the possibility of producing estimates on the 
characteristics of both the clusters and the units. 

In the present paper, we will try to answer the following 
questions: 

a) Can we use the GWSM to handle the estimation 
problem related to populations linked together 
through record linkage? 
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b) Can we adapt the GWSM to take into account the 
linkage weights issued from record linkage? 


c) Can GWSM help in reducing the manual 
resolution required by record linkage? 


d) If there is more than one approach to use the 
GWSM, is there a “better” approach? 


It will be seen that the answer is clearly yes to (a) and 
(b). However, for question (c), it will be shown that there is 
a price to pay in terms of an increase to the sample size, and 
therefore to the collection costs. For question (d), although 
there is no definite answer, some approaches seem to 
generally be more appropriate. 

The paper will first provide a brief overview of record 
linkage. Secondly, the GWSM will be described. Thirdly, 
the GWSM will be adapted to provide three different 
approaches that take into account linkage weights issued 
from record linkage. These approaches will be: (1) use all 
non-zero links with their respective linkage weights; (2) use 
all non-zero links above a given threshold; and (3) choose 
the links randomly using Bernoulli trials. For each of the 
approaches, an unbiased estimator of a total will be 
presented together with a variance formula. Finally, some 
simulation results that compare the three proposed 
approaches to the Classical Approach (where the GWSM is 
used based on links established through a decision rule) will 
be presented. 


2. RECORD LINKAGE 


The concepts of record linkage were introduced by 
Newcome, Kennedy, Axford and James (1959) and for- 
malised in the mathematical model of Fellegi and Sunter 
(1969). As described by Bartlett, Krewski, Wang and 
Zielinski (1993), record linkage is the process of bringing 
together two or more separately recorded pieces of infor- 
mation pertaining to the same unit (individual or business). 
Record linkage is sometimes also called exact matching, in 
contrast to statistical matching. This last process attempts to 
link files that have few units in common (see Budd and 
Radner 1969, Budd 1971, Okner 1972, and Singh, Mantel, 
Kinack and Rowe 1993). With statistical matching, linkages 
are based on similar characteristics rather than unique 
identifying information. In the present paper, we will 
restrict ourselves to the context of record linkage. However, 
the developed theory could also be used for statistical 
matching. 

Suppose that we have two files A and B containing 
characteristics relating to two populations U“ and U®, 
respectively. The two populations are somehow related to 
each other. They can represent, for example, exactly the 
same population, where each of the files contains a different 
set of characteristics of the units of that population. They 
can also represent different populations, but with some 
natural links between them. For example, one population 


can be one of parents, and the other population one of 
children belonging to the parents. Note that the children 
usually live in households that can be viewed as clusters. 
Another example is one of an agricultural survey where the 
first population is a list of farms as determined by the 
Canadian Census of Agriculture and the second population 
is a list of taxation records from the Canadian Customs and 
Revenue Agency (CCRA). In the first population, each 
farm is identified by a unique identifier called the FarmID 
and some additional variables such as the name and address 
of the operators that are collected through the Census 
questionnaire. The second population consists of taxation 
records of individuals who have declared some form of 
agricultural income. These individuals live in households. 
The unique identifier on those records is either a social 
insurance number or a corporation number depending on 
whether or not the business is incorporated. However, each 
income tax report submitted to CCRA contains similar 
variables (name and address of respondent, efc.) as those 
collected by the Census. 

The purpose of record linkage is to link the records of 
the two files A and B. If the records contain unique iden- 
tifiers, then the matching process is trivial. For example, in 
the agriculture example, if both files would contain the 
FarmID, the matching process could be done using a simple 
matching procedure. Unfortunately, often a unique identi- 
fier is not available and then the linkage process needs to 
use some probabilistic approach to decide whether two 
records of the two files are linked together or not. With this 
linkage process, the likelihood of a correct match is 
computed and, based on the magnitude of this likelihood, it 
is decided whether we have a link or not. 

Formally, we consider the product space A xB from the 
two files A and B. Let j indicate a record (or unit) from file 
A (or population U“) and k a record (or unit) from file B 
(or population U®). For each pair (j,k) of AxB, we 
compute a linkage weight reflecting the degree to which the 
pair (j, k) is likely to be a true link. The higher the linkage 
weight is, the more likely the pair (j, k) is a true link. The 
linkage weight is commonly based on the ratios of the 
conditional probabilities of having a match p and an 
unmatch pi given the result of the outcome of the compa- 
rison C,;, of the characteristic q of the records j from A and 
k from B, g=1,..., Q. That is, 


Spe LAG cero ene) 
jk 2 — 
/ Py CREO) 


nee (2.1) 


P(C, | Hyp) 


where 6... = a oor ig =i aie and 
4 PCC ain lin) 
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The mathematical model proposed by Fellegi and Sunter 
(1969) takes into account the probabilities of an error in the 
linkage of units 7 from A and k from B. The linkage weight 
is then defined as 


where 
gsi log, if characteristic q of pair (jk) agrees 
dk log, (1-1 44)/ Nai) otherwise 


with Mey = P (characteristic q agrees |u.,) and 
T1ajx= P (characteristic q agrees | H,,)- Note that the defini- 
tion of @;, assumes that the Q comparisons are inde- 
pendent. 

The linkage weights given by (2.1) are defined on R, the 
set of real numbers, i.e., 0. re ]-°, +[. When the ratio of 
the conditional probabilities of having a match p and an 
unmatch 1 is equal to 1, we get 0, = 0. When this ratio is 
close to 0, 0. , tends to -@. It might then be more conve- 
nient to define the linkage weights on [0,+[. This can be 
achieved by taking the antilogarithm of 0, ,- We then obtain 
the following linkage weight 0; Ps 

. P( wig Cy jy Coin Coin) 
‘é PCH LCi Cojg  Coje) 


Note that the linkage weight 9; , 18 equal to 0 when the 
conditional probabilities of having a match yp is equal to 0. 
In other words, we have 0. = 0 when the probability of 
having a true link for (j, ik) is nul. 

Once a linkage weight 0, has been computed for each 
pair (j, k) of AxB, we need to decide whether the linkage 
weight is sufficiently large to consider the pair (j, k) a link. 
This is typically done using a decision rule. With the 
approach of Fellegi and Sunter, we use an upper threshold 
8,,,,and a lower threshold 6, to which each linkage 
weight 0; , 18 compared. The decision is made as follows: 


(2.2) 


links . if/@)-2'0 


jk ~ “High 
D(j,k) =) can be a link if 0, < 9) <Oyin (2-3) 
nonlink —if 0,<9),,- 


The lower and upper thresholds 0, .,, and 6,,. are 
determined by a priori error bounds based on false links 
and false nonlinks. When applying decision rule (2.3), some 
clerical decisions are needed for those linkage weights 
falling between the lower and upper thresholds. This is 
generally done by looking at the data, and also by using 
auxiliary information. In the agriculture example, variables 
such as date of birth, street address and postal code, which 
are available on both sources of data, can be used for this 
purpose. By being automated and also by working on a 
probabilistic basis, some errors can be introduced in the 
record linkage process. This has been discussed in several 


Sf 


papers, namely Bartlett et al. (1993), Belin (1993) and 
Winkler (1995). 

The application of decision rule (2.3) leads to the 
definition of an indicator variable lip = 1 if the pair (j, k) is 
considered to be a link, and O otherwise. As for the 
decisions that need to be taken for those linkage weights 
falling between the lower and upper thresholds, some 
manual intervention may be needed to decide on the validity 
of the links. In the case where the files A and B represent 
the same population (with a different set of characteristics), 
it is likely that for each unitj from file A, there will be only 
one unit linked in file B. That is, the units should be linked 
on a one-to-one basis. Note that decision rule (2.3) does not 
prevent the existence of many-to-one, one-to-many, or 
many-to-many links. As mentioned before, because of the 
probabilistic aspect of the record linkage process, which 
might introduce some errors, there could be more than one 
link per unit. In practice, this problem is usually solved by 
some manual intervention. In the agriculture example, it can 
occur that multiple operators of a farm each submit a tax 
report to CCRA for the same farm (one-to-many). Simi- 
larly, an operator who runs more than one farm could sub- 
mit only one income tax report for his operations (many-to- 
one). Finally, one can imagine a scenario of many-to-many 
links when an operator runs more than one farm, where 
each farm has a number of different operators. These 
situations can be represented by Figure 1. In Figure 1, unit 
j=l of U4 has a one-to-one link to unit k=1 of U?; unit j=2 
forms to a one-to-many link to units k=2 and k=4; and units 
j=2 and j=3 together form a many-to-one link to unit k=4. 
For the agriculture example, it is clear that deciding on the 
validity of the links is more difficult than the case of the 
same population since the former allows the possibility of 
having true many-to-one or one-to-many situations. 


Figure 1. Example of links 
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3. THE GENERALISED WEIGHT SHARE 
METHOD 


The GWSM is described in Lavallée (1995). It is an 
extension of the Weight Share Method described by Ernst 
(1989) but in the context of longitudinal household surveys. 
Various implications of using the Weight Share Method for 
longitudinal household surveys have been described by 
Gailly and Lavallée (1993). The GWSM can be viewed as 
a generalisation of Network Sampling and also of Adaptive 
Cluster Sampling. These two sampling methods are 
described in Thompson (1992), and Thompson and Seber 
(1996). 

Suppose that a sample s“ of m“ units is selected from 
the population U4’ of M4 units using some sampling 
design. Let mt; be the selection probability of unit 7. We 
assume m; >0 forall je U*. 

Let the population U? contain M® units. This popu- 
lation is divided into N clusters where cluster i contains 
M Es units. For example, in the context of social surveys, the 
clusters can be households and the units can be the persons 
within the households. For business surveys, the clusters 
can be enterprises and the units can be the establishments 
within the enterprises. For the agriculture example, the 
clusters can be households, and the units, persons within the 
household who file an income tax report to CCRA. 

We suppose that there exists a link between the units j of 
population U4 and the units k of clusters i of the population 
U®. This link is identified by an indicator variable Li ik 
where Li = | if there exists a link between unit j¢ U4 and 
unit ike ‘UB, and 0 otherwise. Note that there might be 
some units j of population U“ for which there is no link 
with any unit, k of a cluster i of population U?, i.e., 
LA iy HL = 0 for all je U*. Also, there can be 
zero, one cf more links us any unit k of a cluster i of popu- 
lation U?, ie, Ly =). 144 =9,L,=1 or L,>1 for any 

ke U®. 

With the GWSM, we have the following constraint: 


Each cluster i of U? must have at least one link 
. . ° A . = 
with a unit j of U“, ie., L; ees pone 1b ig? 


This constraint is essential for the GWSM to produce 
unbiased estimates. We will see in section 4 that in the 
context of record linkage, this constraint might not be 
satisfied. 

For each unit j selected in s“, we identify the units ik of U? 
that have a non-zero link with j, i.e., lL. “ik =| eormeach 
identified unit ik, we suppose that we can establish the list 
of the M? units of cluster 7 containing this unit. eee each 
cluster i Go cecnts by itself a population Ua where 
UPAWiews Let OF be the set of the n clusters identified 
by the units jes’. 

From Population’ U®, we are interested in estimating the 
totale ye ; el 1 Y, for some characteristic y. An 
important constraint that is imposed in the measurement (or 
interviewing) process of y is to consider all units within the 


same cluster. That is, if a unit is selected in the sample, then 
every unit of the cluster containing the selected unit is inter- 
viewed. This constraint is one that often arises in surveys 
for two reasons: cost reductions and the need for producing 
estimates on clusters. As an example, for social surveys, 
there is normally a small marginal cost for interviewing all 
persons within the household. On the other hand, household 
estimates are often of interest with respect to poverty 
measures, for example. For the agriculture example, one 
value of interest is the total farm revenue per household. In 
that case, we need to interview all persons within the house- 
hold. 

By using the GWSM, we want to assign an estimation 
weight w,, to each unit k of an interviewed cluster i. To 
ecimaleshe toler belonging to population U, one can 


then use the estimator 
* n ke 
enOD Wik Vik 
i=l k=l 
where nis the number of interviewed clusters and w,, is the 
weight attached to unit k of cluster 7. With the GWSM, the 
estimation process uses the sample s“ together with the 
links existing between U4 and U® to estimate the total Y?. 
The links are in fact used as a bridge to go from population U4 
to population U, and vice versa. 

The GWSM allocates to each interviewed unit ik a final 
weight established from an average of weights calculated 
within each cluster i entering into Y. An initial weight that 
corresponds to the inverse of the selection probability is 
first obtained for all units: k of cluster i of Y having a 
non-zero link with a unit j¢s*. An initial weight of zero is 
assigned to units not having a link. The final weight is 
obtained by calculating the ratio of the sum of the initial 
weights for the cluster over the total number of links for 
that cluster. This final weight is finally assigned to all units 
within the cluster. Note that the fact of allocating the same 
estimation weight to all units has the considerable advan- 
tage of ensuring consistency of estimates for units and 
clusters. 

Formally, each unit k of cluster i entering into Y is 
assigned an initial weight w, as follows: 


(3.1) 


(3.2) 


where t.=1 if j¢s* and 0 otherwise. Note that a unit ik 
having no link with any unit j of U* has automatically an 
initial weight of zero. The final weight w, is given by 


G3) 
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A 
where L, =n l ix: The quantity L, represents the 
number of links between the units of UA ad the unit k of 


cluster i of U?. The quantity L, = yal syle then corresponds 
to the total number of links present in cluster 7. Finally, we 
assign w, =w,; for all ke U, and use equation (3.1) to 
estimate the total Y?. 

Using this last expression, it was shown in Lavallée 
ree that the GWSM is design unbiased. Further, let 

= Y,/L, for all kei, where Y, = ie fey? Chen, Y can be 
ciate! as 


pale Ae pe Ne MA + 
ee me =) l; ik ik Ds ay Z (3.4) 
J=1 qT. t=1 k=l j=l 
ij ij 
and the variance of Y is given by 
M4 MA (x’, * x nr’) 
Vane p i = : —Z,Z, (3.5) 


where Ws is the joint probability of selecting units j and j’. 
See Sarndal, Swensson and Wretman (1992) for the 
calculation of a , under various sampling designs. The 
variance Var (Y) may be unbiasedly estimated from the 
following equation: 


A A ne AA 
M’ M 7 HG Ty) 


Var (Y) spo Sg ieee ne =: 


Re Lot Chay Aig 
FE. aha 1, aay” Pa 

Another unbiased estimator of the variance Var (Y) may 
be developed in the form of Yates and Grundy (1953). 

In presenting the Weight Share Method in the context of 
longitudinal surveys, Ernst (1989) proposed the use of 
constants a in the definition of the estimation weights. In 
the general context of the GWSM, the use of the same type 
of constants can be a: Let us he Oi, 2 0 for all 
pairs (j,k), with a, = Mt yi Os ig = ve can then 
obtain new estimation ‘weights as follows. ay each unit k of 
cluster i entering into y assign the following initial weight 


10. 
Wir : 


M4 t. 
Ki ees J 
Wie LA ae (3.7) 
j=l 1; 
The final weight w,’ is given by 
M M, M4 t 
a a = 1} 
W; 2 Wir , Dy Dy a ik ae (3.8) 
kl k=l G2 Pe 
j 
; B 
Finally, we assign w, =w, for all keU; and use 


equation (3.1) to estimate the total Y?. 

In the context of longitudinal surveys, Ernst (1989) noted 
that the most common choice for the constants a is the one 
where each individual receives one of two values: 0, or a 
non-zero value that is equal for all the remaining units 
within the cluster. In the present context, this would mean 
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to let On = = 0 for ali j and k in a subset bp of Or say, 
and a. = eonstant for all j and k in the complement subset 


Oo Back to the context of longitudinal surveys, Kalton 
and Brick (1995) looked at the determination of optimal 
values for the a of Ernst (1989) where the optimality is 
measured in terms of minimal variance. They concluded 
that: “in the two-household case, the equal household 
weighting scheme minimises the variance of the household 
weights around the inverse selection probability weight 
when the initial sample is an equal epsem (equal probabi- 
lity) one.” They also added that “in the case of an approxi- 
mately epsem sample, the equal household weighting 
scheme should be close to the optimal, at least for the case 
where the members of the household at time t come from 
one or two households at the initial wave.” This suggests 
that, for the GWSM, the choice of letting the constants a 
being 0 for some units and a positive value that is equal for 
all the remaining units within the cluster should be close to 
the optimal. 


4. THE GWSM AND RECORD LINKAGE 


With record linkage, the links Li are established 
between files A and B, or population ‘UA and population 
U®, using a probabilistic process. As mentioned before, 
record linkage uses a decision rule D such as (2.3) to decide 
whether there is a link or not between unit j from file A and 
unit ik from file B. Once the links are established, we then 
have the two populations U“ and U® linked together, with 
the links identified by the indicator variable Li ,- Note that 
the decision rule (2.3) does not prevent the tance of 
complex links (many-to-one, one-to-many, or many-to- 
many). 

Although the links can be complex, the GWSM can be 
used to estimate the total Y? from population U? using a 
sample s* obtained from population U*. Therefore, the 
answer is yes to question (a) stated in the introduction. Note 
that the estimates produced by the application of the 
GWSM might however not be unbiased if the constraint 
mentioned in section 3 is not satisfied. In that case, the use 
of the estimation weight (3.3) underestimates the total ye 
To solve this problem, one practical solution is to collapse 
two clusters in order to get at least one non-zero link li y 
for cluster i. This solution usually requires some manual 
intervention. Another solution is to impute a link by 
choosing one link at random within the cluster, or to choose 
the link with the largest linkage ee 0, ,, Note that it 
might also happen that for a unit j of U*, there is no non- 
zero link lig with any unit ik of U®. This is however not a 
problem since the only coverage in which we are interested 
is the one of U?. 

It is now clear that the GWSM can be used in the context 
of record linkage. The GWSM with the populations U“ and 
U® linked together using record linkage with the decision 
rule (2.3) will be referred to as the Classical Approach. 
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Now, with the Classical Approach, the use of the 
GWSM is based on links identified by the indicator variable 
1. ,. Is it necessary to establish whether there is positively 
a link for each pair (j, ik), or not? Would it be easier to 
simply use the linkage weights 9. , (without using any 
decision rule) to estimate the total Y* from U® using a 
sample from U“? These questions lead to question (b) on 
whether or not it is possible to adapt the GWSM to take into 
account the linkage weights @ issued from record linkage. 

In the present section, we will see that the answer to 
question (b) is yes by providing three approaches where the 
GWSM uses the linkage weights 0. The first approach is to 
use all the non-zero links identified through the record 
linkage process, together with their respective linkage 
weights 8. The second approach is the one where we use all 
the non-zero links with linkage weights above a given 
threshold 0; The third approach is one where the links 
are randomly chosen with probabilities proportional to the 
linkage weights 0. 


4.1 Approach 1: Using all Non-Zero Links With 
Their Respective Linkage Weights 


When using all non-zero links with the GWSM, one 
might want to give more importance to links that have large 
linkage weights 8, compared to those that have small 
linkage weights. By definition, for each pair (j, ik) of 
AxB, the linkage weight 0, ,, reflects the degree to which 
the pair (j, ik) is likely to be a true link. We then no longer 
use the indicator variable l. ,, \dentifying whether there is 
a link or not between unit j from U4 and unit k of cluster i 
from U®. Instead, we use the linkage weight Oi ix obtained 
in the first steps of the record linkage process. (This 
assumes that the file with the linkage weights is available. 
In practice, the only available file is often the linked file 
obtained at the end of the linkage process, once some 
manual resolution has been performed. In this case, the 
linkage weights are no longer available and the three 
proposed approaches to be used with the GWSM are im- 
material to reduce the problem of manual resolution). Note 
that by doing so, we do not need any decision to be taken to 
establish whether there is a link or not between two units. 

For each unit selected in s“, we identify the units ik of U? 
that have a non-zero linkage weight with unit, i.e., 0; ,> 0. 
Let. OF? ape the set of the n® clusters identified i the 
units jes“, where “RL” stands for “Record Linkage’. Note 
that because we use all non-zero linkage weights, we have 
n®-> n. We now obtain the initial weight w,.*" by directly 
replacing the indicator variable / in equations (3.2) and 
(3.3) by the linkage weight 0. 


wikk Fo, eu 
ae 
di 


(4.1) 


The final weight wee is given by 


M; 
+RL 
Wik 
RL 4K 
wee = EE (4.2) 
M; 
Oi, 
k=l 


where ©, = yayee 0, 4 Finally, we assign Way = wy, > for 
all ke U?. Note that by being present both at the numerator 
and denominator of equation (4.2), the linkage weights 0; 
do not need to be between O and 1. They just need” to 
represent the relative likelihood of having a link between 
two units from populations U4 and U®. It is also inter- 
esting no note that by letting 0; x = 9; 4/9; where ©, = 
pve kel 0° 2 we obtain, for the estimation weight rg 
an Eaarneat formulation to the one given by (3.7) and 
(3.8). 

With the Classical Approach, we stated the constraint 
that each cluster i of U ? must have at least one link with a 


unit j of U4, ie., L; = ye 1 Lea J; > 0. This constraint is 
translated here into the ned of having for each cluster i of U? 
at least one non-zero linkage weight 07 , With a unit 7 of 


US, ace yh 1 Leer 9; >0- In thay the record 
linkage ide dae not insure that this constraint is 
satisfied. It might then turn out that for a cluster i of U?, 
there is no non-zero linkage weight 05; , with any unity of Ue. 
In that case, the use of the een weight (4.2) 
underestimates the total Y?. To solve this problem, the 
same solutions proposed in the context of the indicator 
variables /. ,, can be used. That is, a solution is to collapse 
two clusters in order to get at least one non-zero linkage 
weight 0; . Unfortunately, this solution might require 
some manual intervention, which has been avoided up to 
now by not using the decision rule (2.3). A better solution 
is to impute a link by choosing one link at random within 
the cluster, and then assign arbitrarily a small value for Os ik 
to the chosen link (for example, the smallest calculated 
non-zero linkage weight). 

To estimate the total Y? belonging to population U%, 
one can use the estimator 


(4.3) 


Following the same steps used to obtain equation (3.4), 
one can write Y 


RA ape eg ill RL 
Y i pe) ype 0; ik Sik 
jal Tt, fant fea 
POH yea, 
SL Aoerae (4.4) 
eat. 
m2 
where z,=Y,/@, for all keU/, and @,=Y,/,0 
Yee | 9. ,,- Using this last expression, it can be shown that pat 


is descr! unbiased for Y°. The variance of Y-~ is given by 
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Var (led made): iy 8) pet gh 


yell Yres Fg ey, 
ye 1; T; 


(4.5) 


4.2 Approach 2: Use all Non-Zero Links Above a 
given Threshold 


Using all non-zero links with the GWSM as in Approach 
1 might require the manipulation of large files of size 
M4 x M®. This is because it might turn out that most of the 
records between files A and B have non-zero linkage 
weights 9. In practice, even if this happens, we can expect 
that most of these linkage weights will be relatively small 
or negligible to the extent that, although non-zero, the links 
are very unlikely to be true links. In that case, it might be 
useful to only consider the links with a linkage weight 0 
above a given threshold Oiigh- 

For this second approach, we again no longer use the 
indicator variable /; , identifying whether there is a link or 
not, but instead, wee use the linkage weight 9; that are 
above the threshold 9,,,.,,. The linkage weights prion the 
threshold are epnedercd 2 as zeros. We therefore define the 
linkage weight: 

: Oi ix HO in = Suign 

Gr le 

J, tk 
0 otherwise. 


For each unit j selected in s“, we identify the units ik of U? 
that have 0° ,,>0. Let QRLTB be the set of the n®7 
clusters identified by the units j¢s4, where “RLT” stands 
for “Record Linkage with Threshold”. Note that 
n®LT < nRL. On the other hand, we have n®! = n if the 
record linkage between U4 and U® is done by using the 
decision rule (2.3) with O55, = ®Low- 


The initial weight w,, RLT is given by 
ee 
PS ODMR ee (4.6) 
J=1 70): 
J 
The final weight wan is given by 
mM? 
ys wiklT 
ik 
Wee ae (4.7) 


Are t 
2, 
RLT RLT 
f 


=y"io 1 9; Ob ait we assign w,  =Ww; for 
all ke idige . As for peed 1, itis Dene Pa note that 
by letting a, , =, ,,/9," where 7 = ye ; Dice ; Tae we 
obtain, for the estimation weight w*'7, an equivalent 
formulation to the one given by (3.7) and (3.8). 

The number of zero linkage weights 07 will be greater 
than or equal to the number of zero linkage weights 8 used 
by Approach 1. Therefore, the constraint that each cluster 
iof U® must have at least one non-zero linkage weight 0; 


where Or, 
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with a unitj of U* might be more difficult to satisfy. In that 
case, the use of the estimation weight (4.7) underestimate 
the total Y?. To solve this problem, the same solutions 
proposed before can be used. 
To estimate the total Y*, one can use the same estimator 
as (4.3), where we replace the number of identified clusters 
n®! by n®LT and the estimation weight w,, by w,. As 
oe estimator (4.3), it can be shown that this estimator  pRUT 
is design unbiased. 


4.3 Approach 3: Choose the Links by Random 
Selection 


In order to avoid making a decision on whether there is 
a link or not between unit j from U4 and unit k of cluster i 
from U®, one can decide to simply choose the links at 
random from the set of non-zero links. For this, it is reason- 
able to choose the links with probabilities proportional to 
the linkage weights 6. This can be achieved by Bernoulli 
trials where, for each pair (j, ik), we decide on accepting a 
link or not by generating a random number u, ,, ~ U(0,1) 
that is compared to a quantity proportional to the linkage 
weight 0. oe 

In the point of view of record linkage, this approach 
cannot be considered as optimal. When using the decision 
rule (2.3) of Fellegi and Sunter, the idea is to try to 
minimise the number false links and false nonlinks. The 


link Li ig is accepted ek if the linkage weight 0, , 18 large 
Sen Ge5,8:. ik 2 oh )e OL Att AS moderately large (i.e., 
ore w <9 <0, “ep oF has been accepted after manual 


Pescluaoe Sciecnag the links randomly using Bernoulli 
trials might lead to the selection of links that would have 
not been accepted through the decision rule (2.3), even 
though the selection probabilities are proportional to the 
linkage weights. Some of the resulting links between the 
two populations U4 and U? might then be false ones, and 
some units that are not linked might be false nonlinks. The 
linkage errors are therefore likely to be higher than if the 
decision rule (2.3) would be used. However, in the present 
context, the quality of the linkage is of secondary interest. 
The present problem is to try to estimate the total Y? using 
the sample s“ selected from U%, and not to evaluate the 
quality of the links. The precision of the estimates of Y? 
will in fact be measured only in terms of the sampling 
variability of the estimators, by conditioning on the linkage 
weights 05 5 . Note that this sampling variability will take 
into setae the random selection of the links, but not the 
linkage errors. 

The first step before performing the Bernoulli trials is to 
transform the linkage weights in order to restrict them to the 
[0,1] interval. By looking at (2.1), it can be seen that the 
linkage weights 0; ,, correspond in fact to a logit transfor- 
mation (in ‘base 2) Ot, Wethew probability 
P(x | Cy joj ++ Co) Similarly, the linkage weights 
given by (2.2) depend only on this probability. Hence, one 
way to transform the linkage weights is simply to use the 
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probability PM, | Cie Dik ° Co jn): From (2.1), we obtain 
this result by using the FancHOn 6 = 2°/(1 +2°). From (2.2), 
we use 6 = 6/(1 +6). When the linkage weights are not 
obtained through (2.1) nor (2.2), a possible transformation 
is to divide each Hes weight by the maximum possible 


value 0\,,. = Max;_)';-1,,-1 9; ,- Note that we assume that 
the linkages weights are all greater than or equal to zero, 
which is the case with definition (2.2), but not necessarily 
in general. 

Once the adjusted linkage weights 0 have been 
obtained, for each pair (j, ik), we penerite a random 
number Ui in ~ U(O," 1). Then, we set the indicator variable 
01, Hig to 1 ue Wes 6. ik? and 0 otherwise. This process 
provides a set os links similar to the ones used in the 
Classical Approach, with the exception that now the links 
have been determined randomly instead of through a 
decision process comparable to (2.3). Note that since 
E(1. ie CF 0 the sum of the adjusted linkage weights 6. ik 
corresponds to the expected total number of links L from 
the Bernoulli process in, A XB, i.e., 


M4 Nn M, 
Ds Ss Oyo: ee) 


A’ we identify the units ik of U? 
that have ia aM et oP be the set of the 7n clusters 
identified by ‘the units jes* . Note that n < n®/, Unfortu- 
nately, in contrast to n® L and n®LT’ the random number of 
clusters n is hardly comparable to n. 

The initial weight w,, is defined as follows: 


For each unit) selected in s 


a) MS peg WR, MA 
OD ie OD ATED Fee NC RS) 
j=l 7. i=l k=l Ye Site 
ij j 
The final weight w, is given by 
mM? 
Wit 
Wi cs (4.10) 
L me 
jar 


where L, ae 1... The quantity ibe represents the 
realised number of Hitt between the iinits of U4 and the 
unit k of cluster i Sty penrakee U®. Finally, we assign 
Wi = W; for all ke U, 

To estimate the total Y®, we can use the estimator 


(4.11) 


By conditioning on the accepted links /, itcan be shown 
that estimator (4.11) is conditionally design unbiased and 
hence, unconditionally design unbiased. Note that by 
conditioning on /, the estimator (4.11) is then equivalent to 


(3.1). To get the variance of Y, again conditional argu- 
ments need to be used. Letting the subscript 1 indicate that 
the expectation is taken over all possible sets of links, we 
have 


Var(¥) = E, Var,(¥)+Var,E(¥). (4.12) 
First, from conditional unbiasedness, we have 
E,(¥) = (4.13) 
Therefore, 
Var, E,(Y) = (4.14) 
Second, from (3.5), we directly have 
Mh res 
Var, (¥) = SPS a Ls Bg 7 (4.15) 


elds : 
1; Ty 


where Zi is defined as in (3.4) but with the links / replaced 
by /. Hence, the variance of Y can be expressed as 


MA MA (qf — qt 4) 


var,(¥) = | © yy 4422) (4.16) 
jal ipa 1; Tj 


where the expectation is taken over all possible sets of 
links. 

With the GWSM, we stated in section 3 a constraint that 
must be satisfied for unbiasedness of the GWSM. In the 
present approach, by randomly selecting the links, it is very 
likely that this constraint will not be satisfied. To solve this 
problem, we can impute a link by choosing the one with the 
highest non-zero linkage weight 0, ,, within the cluster. If 
there is still no link because all 0, = = 0, it is possible to 
choose one link at random within the cluster. It should be 
noted that this solution preserves the design unbiasedness 
of the GWSM. 


4.4 Some Remarks 


The three proposed approaches do not use the decision 
rule (2.3). They also not make use of any manual resolution. 
Hence, the answer to the question (c) of the introduction is 
yes. That is, GWSM can help in reducing the manual 
resolution required by record linkage. Note that there is 
however a price to pay for avoiding manual resolution. 

First, with Approach 1, the number n RL of clusters 
identified by the units j¢s* is greater than or equal to the 
number n of clusters identified by the Classical Approach, 
i.e., when the decision rule (2.3) is used to identify the 
links. This is because we use all non-zero links, and not just 
the ones satisfying the decision rule (2.3). As a conse- 
quence, the collection costs with Approach | will be greater 
than or equal to the ones related to the use of the Classical 
Approach. It needs then to be checked which ones are the 
most important: the collections costs or the costs of manual 
resolution. Note that if the precision resulting from the use 
of Approach | is much higher than one from the Classical 
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Approach, it might be more of interest to use the former 
than the latter. 

With Approach 2, we have n®"7 < m8" and therefore the 
collection costs of this approach are less than or equal to the 
ones of Approach 1. If the precision of Approach 2 is 
comparable to the one of Approach 1, then the former will 
certainly be more advantageous than the latter. By 
comparing Approach 2 with the Classical Approach, it can 
be seen that the collection costs can be almost equivalent if 
the value of the threshold 9,,..,, is chosen to be close to the 
lower and upper thresholds oft the decision rule (2.3). Note 
that Approach 2 is not using any manual resolution. If the 
precision of Approach 2 is at least comparable to the one of 
the Classical Approach, then Approach 2 will have a clear 
advantage. Note also that if 9,,. ona 8, ow? the two approach 
differs only in the definition of the estimation weights 
obtained by the GWSM. Approach 2 uses the linkage 
weights 8, while the Classical Approach uses the indicator 
variables /. After setting Ohtign = 8 ow? it is certainly of 
interest to verify which approach has the highest precision. 

With Approach 3, the number of selected links will be 
less than or equal to the number of non-zero links used by 
Approach l, i.e., n < n®L. Hence, the collection costs of 
Approach 3 will be less than or equal to the ones of 
Approach 1. In terms of precision, it is not clear which 
variance is likely to be the smallest between to two 
approaches. As mentioned before, in opposite to n®" and 

RLT the random number of clusters n is hardly compa- 
rable to n. The two depends on different parameters: The 
Classical Approach depends on the thresholds 6, and 
945; gh? while Approach 3 depends on the adjusted linkage 
weights Oe that correspond to the selection probabilities 
of the links. 


5. SIMULATION STUDY 


A simulation study was performed to evaluate the 
proposed approaches against the Classical Approach where 
the decision rule (2.3) is used to determine the links. This 
study was made by comparing the precision obtained for the 
estimation of a total Y® using five different approaches: 


Approach 1: use all non-zero links with their 
respective linkage weights 


Approach 2: use all non-zero links above a threshold 


Approach 3: choose the links randomly using 
Bernoulli trials 


Approach 4: Classical Approach 


Approach 5: use all non-zero links, but with the 
indicator variable / 


Approach 5 is a mixture of Approach 1 and the Classical 
Approach. It is basically to first accept as links all eee 
(j, ik) with a non-zero linkage weights, i.e., assign Lin = 
for all pairs (j, 1k) where @; ; ,> 0, and 0 otherwise. ae 
GWSM described in section af is then used to produce the 
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estimate of Y?. Approach 5 was added to the simulations to 
see the effect of using the indicator variable / instead of the 
linkage weight 8 when using all non-zero links. As for the 
other approaches, Approach 5 can be shown to be unbiased. 

Given that all five approaches yield design unbiased 
estimates of the total Y%, the quantity of interest for 
comparing the various approaches was the standard error of 
the estimate, or simply the coefficient of variation (i.e., the 
ratio of the square root of the variance to the expected 
value). 


The simulation study was performed based on the 
agriculture example mentioned throughout the paper. This 
example corresponds in fact to a real situation occurring at 
Statistics Canada related to the construction of the Whole 
Farm Data Base (see Statistics Canada 2000). Note that 
although the simulation study was based on a real situation, 
some of the numbers used have been changed for 
confidentiality reasons. Also, the linkage process did not 
reflect the exact procedure used within Statistics Canada. 
For more information on the exact procedure, see Lim 
(2000). It was felt that these changes do not negate the 
results of the simulation study. The main purpose of the 
simulations was to evaluate the proposed approaches 
against the Classical Approach. It was not intended to solve 
the problems related to the construction of the Whole Farm 
Data Base, which could be considered as a secondary goal. 

Recall that the agriculture example is one of an 
agricultural survey where the first population U4 is a list of 
farms as determined by the Canadian Census of 
Agriculture. This list is from the 1996 Farm Register, which 
is essentially a list of all records collected during the 1991 
Census of Agriculture with all the updates that have 
occurred since 1991. It contains a farm operator identifier 
together with some socio-demographic variables related to 
the farm operators. The second population U? is a list of 
taxation records from the CCRA. This second list is the 
1996 Unincorporated CCRA Tax File that contains data on 
tax filers declaring at least one farming income. It contains 
a household identifier (only on a sample basis), a tax filer 
identifier, and also socio-demographic variables related to 
the tax filers. 

At Statistics Canada, Agriculture Division produces 
estimates on crops and livestocks using samples selected 
from the Farm Register (population U“). To create the 
Whole Farm Data Base, it is of interest to collect tax data 
for the farms that have been selected in the samples from 
the Farm Register. This is done by first merging the Farm 
Register with the Unincorporated CCRA Tax File 
(population U?) and then obtaining the tax data from 
CCRA. As mentioned before, it turns out that the 
relationship between the farm operators of the Farm 
Register and the tax filers from the Unincorporated CCRA 
Tax File is not one-to-one. This is why the GWSM turns out 
to be a useful approach for producing estimation weights 
for the tax filers selected through the sample of farm 
operators from the Farm Register. 
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Some might argue that there is no need to obtain a set of 
clusters identified by the units jes“, since the target 
population U? is one of tax filers from the Unincorporated 
CCRA Tax File, which is usually available on a census 
basis. Note however that this is not totally true. Not all 
variables of interest are available on this file and Statistics 
Canada needs to pay for the extra variables requested from 
CCRA. Also, the data from the Unincorporated CCRA Tax 
File are not free of errors due to keying, coding, etc., and 
therefore there are some costs related to cleaning up the 
data. For these reasons, it is found preferable to restrict the 
data from the target population U? to a subset only. Since 
this needs to be done, one way of identifying the set of 
clusters to be used in the estimate of Y? is simply to do it 
through the sample s“ selected from U4. 

Apart from the Classical Approach, all approaches 
consider the linkage itself between U4 and U® as a 
secondary goal, the first one being to produce an estimate Y? 
for the target population U?. However, the application 
mentioned here is one related to the Whole Farm Data Base, 
which aims to be an integrated data base. Not having a 
linkage of good quality between the populations U* and 
U® would lead to erroneous microdata analyses between 
the crops and livestocks variables measured in the sample s@ 
and the tax data obtained from U¥. On this aspect, the 
authors agree that the proposed approaches, with the 
exception of the Classical Approach, are not viable in the 
present context. This is true however in a long term point of 
view. Because manual resolution is needed when using a 
decision rule such as (2.3), one could suggest to use the 
proposed approaches to produce some of the required esti- 
mates from U® in the short term, before the final linkage is 
available, after manual resolution. Recall that the main 
purpose of the simulations is to evaluate the proposed 
approaches against the Classical Approach. The agriculture 
example has not been chosen because it corresponds to a 
real situation, but more because of the availability of the 
data. It could have been any other example such as the other 
one mentioned in the introduction where U“ is a population 
of parents and U? a population of children belonging to the 
parents. 

For the purpose of the simulations, two provinces of 
Canada were considered: New Brunswick and Québec. The 
former can be considered as a small province and the latter 
a large one. Table | provides the size of the different files. 
Because the household identifier is not available for the 
entire population U®, for the purpose of the simulations, it 
has been constructed based on a sample. This sample has 
the household identifier coded for each tax filer. For the 
non-sample tax filers, the household identifiers were 
randomly assigned such that the household sizes correspond 
to the same proportions of household sizes found in the 
sample. 


Table 1 
Agriculture Example 


Québec New Brunswick 
Size of Farm Register (U“) 43017 4930 
Size of Tax File (U ®) 52394 5155 
Total number of households of U # 22387 2194 
Total number of Non-zero Linkage 105113 13787 


Weights 


The linkage process used for the simulations was a 
match using five variables. It was performed using the 
MERGE statement in SAS®. All records on both files were 
compared to one another in order to see if a potential match 
had occurred. The record linkage was performed using the 
following five key variables common to both sources: 


— first name (modified using NYSIS) 
— last name (modified using NYSIS) 
birth date 

— street address 

— postal code 


The first name and last name variables were modified 
using the NYSIS system. This basically changes the name 
in phonetic expressions, which in turn increases the chance 
of finding matches by reducing the probability that a good 
match is rejected because of a spelling mistake. For more 
details about NYSUS, see Lynch and Arends (1977). 

Records that matched on all 5 variables received the 
highest linkage weight ( 80=60). Records that matched on 
only a subset of at least 2 of the 5 variables received a lower 
linkage weight (as low as 9=2). It should be noted that the 
levels of the linkage weights were chosen arbitrarily. As 
mentioned before, it is not really the levels themselves that 
are important, but rather the relative importance of the 
linkage weights between each other. 

Records that did not match on any combination of key 
variables were not considered as potential links, which is 
equivalent as having a linkage weight of zero. Two 
different thresholds were used for the simulations: 
Duigh = Pow = 15 and Oy, = 91,4 = 30. The upper and 
lower thresholds, Ouigh and 6,» were set to be the same 
to avoid the grey area where some manual intervention is 
needed when applying the decision rule (2.3). 

Note that the constraint related to the use of the GWSM 
needed to be satisfied. When for a cluster i of U? there was 
no non-zero linkage weight 0, ;, between any units k of this 
cluster and the units from U4, we imputed a link by 
choosing the link with the largest linkage weight Os ix 
within the cluster. Note that it also happened that for some 
units j of U4, there was no non-zero linkage weight 0; Ys 
with any unit ik of U®, this was not considered a problem 
since the only coverage in which we are interested is the 
one of U®. Table 1 provides the total number of non-zero 
links found in each of the two provinces. 
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For the simulations, we have selected the sample from 
U* (i.e, the Farm Register) using Simple Random 
Sampling Without Replacement (SRSWOR), without any 
stratification. We also considered two sampling fractions: 
30% and 70%. The quantity of interest Y® to be estimated 
was the Total Farming Income. 

Since we have the whole population of farms and 
taxation records, it was possible for us to calculate the 
theoretical variance for these estimates. It was also possible 
to estimate this variance by selecting a large number of 
samples (i.e., performing a Monte-Carlo study), estimating 
the parameter Y? for each sample, and then calculating the 
variance of all the estimates. Both approaches were used. 
For the simulations, 500 simple random samples were 
selected for each approach for the two different sampling 
fractions (30% and 70%). The two thresholds (15 and 30) 
were also used to better understand the properties of the 
given estimators. 

Because we assumed SRS WOR, the theoretical formulas 
given in section 4 could be simplified. For example, under 
SRSWOR, the variance formula (4.5) reduced to the 
following: 


Var (Y*") =i) be SP Sim (5.1) 


where f = m4/M¢ isthe sampling fraction, 5 p, = 1/M4-1 
RAT AS a Ce aug thy bale ole ibe 

The Monte-Carlo study involved 500 replicates. For each 
of the two sampling fractions (30% and 70%), 500 simple 
random samples t were selected, and the expectation and 


variance for each of the five approaches were then esti- 
mated using 


ern ] ‘ 
P= ae ; 
(Y) 500 2 |! (5.2) 
and 
Pan. (eet 2s 
VY) =) - EW). (5.3) 


500 121 


The estimated coefficients of variation (CVs) were 
obtained by using 


(5.4) 


The Monte-Carlo process was performed to verify 
empirically the exactness of the theoretical formulas 
provided in section 4. The results indicate that all the 
theoretical formulas provided were exact. 

The results of the study are presented in Figures 2.1 to 
2.4, Table 2, and Figure 3. Figures 2.1 to 2.4 provide bar 
charts of the CVs obtained for each of the five approaches. 
The bar charts are given for the eight cases obtained by 
crossing the two provinces Québec and New Brunswick, 
the two sampling fractions 30% and 70%, and the two 
thresholds 15 and 30. On each bar of the charts, one can 
find the number of non-zero links between U“ and U® for 
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each of the five approaches. Note that for Approach 3, it 
corresponds in fact to the expected number of non-zero 
links. The number of (expected) non-zero links does not 
change from one sampling fraction to another. Table 2 
shows the average number of clusters interviewed by 
approach, for each of the eight cases, where the average is 
taken over the 500 samples used for the simulations. The 
numbers in parenthesis are the standard deviations. They 
are relatively small compared to the averages and therefore 
the number of clusters identified through the sample s“@ 
does not fluctuate greatly from one sample s“ to another. 
Figure 3 provides scattered plots of the obtained CVs by the 
average number of clusters identified through the sample 
s“, for each of the eight cases. 

By looking at the Figures 2.1 to 2.4, it can be seen that in 
all cases, Approach 1 and Approach 5 provided the smallest 
CVs for the estimation of the Total Farming Income. 
Therefore, using all non-zero links yield the greatest 
precision. Note however that by looking at Table 2, we can 
see that these approaches also lead to the highest number of 
clusters identified through the sample selected from U%. In 
fact, we can see that the greater the number of clusters used 
in the estimation is, the greater the precision of the resulting 
estimates is. This result is shown in Figure 3 where we can 
see that the CVs tend to decrease as the average number of 
clusters identified through s“ increases. Although this 
result is well known in the classical sampling theory, it was 
not guaranteed to hold in the context of the GWSM. As we 
can see from equation (3.5), it is not the sample size of s@ 
that increases, but rather the homogeneity of the derived 
variables Z.. 

Now, by comparing Approach 1 and Approach 5, it can 
be seen that the latter always provided the smallest 
variance. Therefore, this suggests to use the indicator 
variable / instead of the linkage weight 9 when using all 
non-zero links. Note that it seems this can be generalised 
since the same phenomenon occurred with Approach 2 and 
Approach 4 (Classical Approach). Recall that, because 
OF es 8, ow the two approaches differ only in the definition 
of the estimation weights obtained by the GWSM. 
Approach 2 uses the linkage weights 8, while the Classical 
Approach uses the indicator variables /. Note that this 
results goes along the conclusions of Kalton and Brick 
(1995) since the optimal choice of letting the constants a 
being O for some units and a positive value that is equal for 
all the remaining units within the cluster corresponds to the 
use of the indicator variable /. 

We now concentrate on Approach 3. For seven out of the 
eight histograms of Figures 2.1 to 2.4, Approach 3 
produced the highest CVs. The only lower CV was 
obtained for Québec, with the sampling fraction of 30% and 
the threshold 8, =30. It should however be noted that 
this approach is the one that used the lowest number of 
non-zero links, and also the lowest average number of 
clusters identified through s*. Therefore, this result is not 
totally surprising. Recall that the number of non-zero links 
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used by Approach 3 does not depend on the threshold 9,,, oh 
and thus the CVs obtained for Québec with f=0.3 were 
equal for 6,,..,=15 and 6), High yO FOr Or ail So thes CN, 
obtained for ‘Approach 3 for Québec was her than the 
ones for Approaches 2 and 4, and these two were using 
more non-zero links, and more clusters. For 0,,,,=30 the 
CV obtained for Approach 3 was lower than the ones from 
approaches 2 and 4, but these two were still using more 
non-zero links, and more clusters. Therefore, there are 
intermediate situations where with 15 < Bish < 30, we 
should get equal CVs for approaches 3 and 2, and 
approaches 3 and 4. As a consequence, to get equal CVs 
between Approach 3 and each of approaches 2 and 4, more 
non-zero links and more clusters must be used by the latter. 
This suggests that in some cases, Approach 3 might be 
more appropriate to use than approaches 2 and 4 because 
estimates with the same precision can be obtained with 
lower collection costs. 

In order to better compare Approach 3 to the approaches 
2 and 4, we forced the number of expected non-zero links 
to be the same as the number of non-zero links used by 
approaches 2 and 4. For this, we have transformed the 
linkage weights 0 x t0 Oy in order to have 


N 
a fe 
> 95 in = Lo om 


where Ly is the desired number of non-zero links. The 
transformation used was 


ne 0. ,/0, if Sik ad 2G 
8 ix = J, ik 0, (5.6) 


1 otherwise 


where 0, was determined iteratively such that (5.5) is 
satisfied. The use of Approach 3 with the transformation 
(5.6) is referred to as Approach 6. The results of the 
simulations are presented in Figures 4.1 to 4.4. As we can 
see, Approach 6 turned out to have the smallest CVs for 
half of the cases. For the other cases, Approach 4 yielded 
the best precision. Note that this situation did not occur for 
a particular province only, nor a particular sampling 
fraction, and also nor for a particular threshold. It would 
therefore be difficult in practice to determine in advance 
which of Approach 6 or Approach 4 would produce the 
smallest CVs. Because of this, and because of the fact that 
Approach 6 (and Approach 3) can produce large linkage 
errors, Approach 4 should be preferred. 
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Table 2 
Average Number of Identified Cluster 


Threshold Approach Average number of identified clusters (s.e.) 
Quebec New Brunswick 
iS) <7) jae f=" 
1 15752(58) 21106(30) 1709(18) 2100(7) 
2 14281(49) 20593(34) 1310017) 1966(13) 
15 3 10930(S50) 18881(47) 1123(14) 1869(14) 
4 14281(49) 20593(34) 1310017) 1966(13) 
5 15752(58) 21106(30) 1709(18) 2100(7) 
1 15752(58) 21106(30) 1709(18) 2100(7) 
2 11310(45) 19139(37) 1215(17) 1924(15) 
30 3 10930(50) 18881(47) 1123(14) 1869(14) 
4 11310(45) 19139(37) 1215(17) 1924(15) 
5 15752(58) 21106(30) 1709(18) 2100(7) 
CV vs # of clusters CV vs # of clusters 
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Figure 3. Graphs of CVs versus Average Number of Identified Clusters 
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Figure 4.3. CVs for Québec (with 6,,,, = 91,4 = 15). 
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6. CONCLUSION 


In the present paper, we have seen that the GWSM is 
adaptable to populations linked through Record Linkage. 
This is in fact simply a natural extension of the case where 
the links are either present or absent, which corresponds to 
the use of an indicator variable 1 vie = 1 if the pair (j, ik) is 
considered to be a link, 0 otherwise. When two populations 
are linked through record linkage, there is always some 
uncertainty left because the decisions on the links are made 
using a probabilistic approach. Therefore, replacing the 
indicator variable Lig by the linkage weight 9 ix that has 
been computed for each pair (j, ik) simply makes the 
GWSM more generalised. 

Some simulations were performed using the 1996 Farm 
Register (population U“) and the 1996 Unincorporated 
CCRA Tax File (population U?). We compared the 
variances obtained for each of the five approaches: (1) use 
all non-zero links; (2) use all non-zero links above a 
threshold; (3) choose links randomly using Bernoulli trials 
(4) Classical Approach; (5) use all non-zero links, but with 
the indicator variable /. All results showed that Approach 1 
and Approach 5 provide the smallest CVs for the estimation 
of the Total Farming Income. These two approaches use 
however the highest number of links, and also the highest 
number of clusters identified through s“, which implies the 
highest collection costs. Because of this, the approaches 2, 
3 and 4 might be viewed as good compromises. 

For a given threshold 6,,,,, it is preferable to use the 
indicator variable / instead of the linkage weights 0 in the 
construction of the estimation weights with the GWSM. 
This result holds even for Ohign =0 (i.e., no threshold is 
used), as for approaches 1 and 5. The estimates produced 
with the indicator variable / always had the smallest CVs 
and this result goes along the conclusions of Kalton and 
Brick (1995). Hence, Approach 5 should be preferred to 
Approach 1, and Approach 4 should be preferred to 
Approach 2. 

The use of the threshold 6,,.,, is useful to reduce the 
number of non-zero links to be manipulated. By reducing 
the number of non-zero links, we reduce as well the number 
of clusters identified through the sample s“, and hence we 
reduce the collection costs associated to the measurement 
of the variable of interest y within the clusters. Note that by 
reducing the number of links, we decrease the precision of 
the estimates produced. Therefore, a choice needs to be 
made between the desired precision and the collection 
costs. 

The reduction of the number of non-zero links can also 
be achieved by using the decision rule (2.3) with the two 
thresholds 0, and Quien: This decreases the collection 
costs, but introduces the need of some manual resolution 
when the linkage weights 0 are between 0, and 0... 
The manual resolution leads however to better links, i.e., 
with less linkage errors. If manual resolution is used only to 
make the links one-to-one between population U* and 
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population U?, then it might not be necessary since the 
GWSM is particularly appropriate to handle estimation in 
situations where the links between U4 and U® are 
complex. 

When compared to approaches 2 and 4, Approach 3 
turned out to be preferable in some cases. Because it would 
be difficult in practice to determine in advance which of 
Approach 3 or Approach 4 would produce the smallest 
CVs, and because of the fact that Approach 3 can produce 
large linkage errors, Approach 4 should be preferred. 
Hence, the Classical Approach of using the GWSM with 
the indicator variable / with links determined using a 
decision rule such as (2.3) seems the most appropriate 
approach to estimate the total Y? using a sample selected 
from U4. 
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Cross-sectional Estimation in Multiple-Panel Household Surveys 


TAKIS MERKOURIS' 


ABSTRACT 


This paper presents weighting procedures that combine information from multiple panels of a repeated panel household 
survey for cross-sectional estimation. The non static nature of a repeated panel survey is discussed in relation to estimation 
of population parameters at any wave of the survey. A repeated panel survey with overlapping panels is described as a 
special type of multiple frame survey, with the frames of the panels forming a time sequence. The paper proposes weighting 
strategies suitable for various multiple-panel survey situations. The proposed weighting schemes involve an adjustment of 
weights in domains of the combined panel sample that represent identical time periods covered by the individual panels. 
A weight adjustment procedure that deals with changes in the panels over time is discussed. The integration of the various 
weight adjustments required for cross-sectional estimation in a repeated panel household survey is also discussed. 


KEY WORDS: Repeated panel surveys; Multiple frames; Temporal domains; Combined panels; Cross-sectional 


weighting; Weight share method. 


1. INTRODUCTION 


A panel survey collects the survey data for the same 
sample elements at different time points (the survey waves). 
A repeated panel survey is made up of a series of panel 
surveys, each having fixed duration, with the panels 
selected at different time points. In a repeated panel house- 
hold survey a sample of households is selected for each 
panel from the population of households existing at the start 
of the panel. Depending on the objectives of the panel 
survey, one or all individuals in the sampled households 
become panel members to be followed throughout the 
duration of the panel or until they leave the survey popu- 
lation. At a subsequent survey wave the household sample 
consists of all the households in which panel members 
reside. A review of various types of panel surveys is given 
in Kalton and Citro (1993). A formalization of related 
concepts can be found in Deville (1998). 

The type of repeated panel household survey considered 
in this paper consists of two or more panels covering over- 
lapping time periods. A typical example of such a survey is 
the Canadian Survey of Labour and Income Dynamics 
(SLID), which employs two overlapping panels of duration 
of six years each; for a description of the SLID see Lavigne 
and Michaud (1998). In the SLID, each new panel is intro- 
duced three years after the introduction of the previous one. 
The sample for each panel is made up of two rotation 
groups from the Canadian Labour Force Survey, which uses 
a stratified multistage design with an area frame wherein 
dwellings containing households are the final sampling 
units. 

A panel survey, though primarily conducted for longi- 
tudinal purposes, may also be used to produce cross- 
sectional estimates of population parameters for any survey 
wave. For cross-sectional purposes, data are usually 
collected at each survey wave for all individuals living in 


1 


households that contain at least one selected member. The 
process of obtaining cross-sectional estimates at any wave 
of a panel household survey after the first wave presents 
difficulties arising from the population and panel dynamics. 
Weighting schemes that deal with dynamic features of a 
single panel, such as movers and “cohabitants,” have been 
discussed in the literature; see Kalton and Brick (1995), and 
Lavallée (1995) for details. Yet, there seems to be a paucity 
of work in the literature on cross-sectional estimation for 
repeated panel household surveys with overlapping panels; 
some initial work in the context of the SLID can be found 
in Lavallée (1994). The cross-sectional estimation problem 
in such multiple panel surveys is a proper combination of 
the panels that would account for the changes in the 
population and in the panels over time. 

This paper describes procedures for cross-sectional esti- 
mation that combine information from overlapping panels 
of a repeated panel household survey. The coverage of the 
population by the individual panels at any given wave, and 
the use of the combined panels supplemented by a “top-up” 
sample to construct a representative cross-sectional sample 
are discussed in section 2. Also discussed in the same 
section are analogies with a multiple-frame survey scheme, 
as well as issues related to the sample dynamics. The 
weighting and estimation problem in repeated panel 
household surveys is described in section 3. Weighting 
strategies suitable for various panel survey situations are 
then proposed. Bias and efficiency issues related to the 
combination of panels are discussed. A weight adjustment 
procedure that deals effectively with changes in the 
combined panels over time is described in section 4. The 
integration of the various weight adjustments required for 
cross-sectional estimation in a repeated panel household 
survey is discussed in section 5. Finally, a summary and 
concluding remarks are provided in section 6. 
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2. GENERAL CONSIDERATIONS 


2.1 Coverage of the Cross-sectional Population 


Important to cross-sectional estimation are changes in 
the population composition over time, occurring when indi- 
viduals leave or enter the population. In a single-panel 
household survey, new entrants who have joined the survey 
population since the start of the panel are not represented in 
the sample at later waves if they live in households that do 
not contain any members of the original population. A 
multiple-panel household survey with overlapping panels 
provides a better coverage of the survey population than a 
single-panel survey, as it reduces the time period not 
covered by any of the panels. In the case of the SLID, this 
time period is reduced from a maximum of six years to a 
maximum of three years. Nevertheless, the problem of 
complete coverage remains unless a special supplementary 
sample of the non-covered population is taken at each 
survey wave. A survey scheme involving one panel and a 
supplementary sample drawn at each survey wave for cross- 
sectional purposes is described in Lavallée (1995). An 
alternative approach involves the selection, at each wave, of 
a new sample that covers the entire survey population but 
does not form a new panel. This sample (henceforth to be 
called top-up sample) is to be used only once, for cross- 
sectional purposes, and its size would normally be smaller 
than a panel’s size. In the context of constructing a cross- 
sectional sample, a top-up sample is discussed as a non- 
trivial case of supplementary sample, essentially treated as 
an additional small overlapping panel. 

The situation with regard to individuals who leave the 
population is as follows. For any panel, the sampling frame 
for the survey population at a time point f is essentially the 
sampling frame for the population at the start of the panel, 
with the leavers in the intervening period being treated as 
blanks on the frame. Panel members who leave the 
population before time t correspond to blanks on the frame, 
and thus their effect on cross-sectional estimates at time f is 
loss of efficiency but not bias; see also Kalton and Brick 
(1995) for relevant discussion. 

The foregoing observations lead to the following 
perspective regarding the coverage of the population by 
each of the panels at any wave of the survey. As regards 
cross-sectional representation, each panel covers at the time 
of its selection the entire survey population represented by 
the preceding panels. Accordingly, the frames of the panels 
form a time sequence, with the frame of each panel 
containing at the start of the panel the frames of the 
preceding panels. In such a sequence of frames, a common 
frame is formed sequentially as the intersection of the frame 
of anew panel with the remainder of the original common 
frame of the preceding active panels. At any wave the 
common frame is the common frame at the start of the most 
recent of these panels, but without the leavers. The non- 
overlap frame domain at the start of a new panel consists of 


individuals who entered the population after the start of the 
preceding panel. Other frame domains (relatively very small 
in size) may be formed by returning units of older frames, 
in which case the time sequence of frames is not completely 
nested. Because of the latter type of frame domains, the 
complete frame at any wave after the selection of the most 
recent panel is the union of the frames of all panels at that 
time point, not just the remainder of the frame of the most 
recent panel. In panel surveys which employ a top-up 
sample at each wave the complete frame is that of the top- 
up sample. 


2.2 A Multiple Frame Analogy 


With the above considerations, a multiple panel survey 
with overlapping panels can be thought of as a special type 
of multiple frame survey, in which the frame for the cross- 
sectional population is the union of mutually exclusive 
temporal domains defined by the frames of the panels and 
their intersections. The sizes of the frames of the individual 
panels, as well as the characteristics of the population 
members in each panel’s frame, change over time. This 1s 
in contrast with the static character of the usual type of 
multiple frame survey. Also, there is a high degree of 
nesting in the sequence of panel frames, so that the total 
number of mutually exclusive temporal frame domains is 
small. Among the various frame domains the one that is 
common to all panels is by far the largest. These special 
multiple frame features have implications in cross-sectional 
estimation, as will be discussed in the next section. 

The sample temporal domains may be even less static 
because of attrition, moves of selected individuals within 
and between panels and moves of non-selected individuals 
into households in which panel members reside. For 
instance, with the presence of new entrants (e.g., immi- 
grants) in households that contain selected individuals, a 
panel crosses the boundary of its frame into the frame of the 
succeeding panel. 

The analogy with multiple-frame survey sampling places 
the problem of cross-sectional estimation for repeated 
surveys with overlapping panels into a familiar framework. 
However, the distinctive dynamic features of multiple panel 
surveys will have to be considered if conventional multiple 
frame approaches are contemplated for the formulation of 
a cross-sectional estimation methodology. 

For the purpose of introducing a cross-sectional estima- 
tion procedure that combines information from the panels 
of a repeated panel household survey, it suffices to consider 
the simple situation involving two overlapping panels at the 
time point of the start of the second panel. Note that this 
would always be the situation in a survey with one panel 
and a top-up sample. Thus, adopting standard multiple 
frame notation, with B and A denoting the frames of the first 
and the second panel (BcA)) at the start of the second 
panel, and with s,, 5, denoting the respective samples, the 
setting can be presented schematically as in Figure 1. 
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Figure 1. Two overlapping panels at 
the start of the second 
panel. 


In Figure 1, A is the complete frame, so that the second 
panel at its start represents the cross-sectional population at 
that time. The overlap domain B is the remainder of the 
original frame of the first panel. The domain a = B°{MA 
consists of all new entrants into the population since the 
start of the first panel. The samples s, and s, are the origi- 
nally selected ones, with Sp reduced in size because of 
leavers and non respondents. It is assumed that the samples s , 
and s, are drawn independently from A and B according to 
specified probability designs p,(s,) and p,(s,), which 
determine the inclusion probabilities z,, and 7,, of the i-th 
unit (household or any individual within it) for the original 
samples s, and s,, respectively. The samples s, and s, 
may intersect, since members in the overlap frame B can be 
selected in both panels. The issue of panel (sample) overlap 
is akin to that of duplicate sample units in multiple frame 
surveys. In repeated panel household surveys an operational 
constraint motivated by respondent burden may be to 
exclude from s Fi individuals already selected in s,, thus 
inducing Sy 1) Sp = 0; for a discussion on this see Lavallée 
(1994). Here, as in any multiple frame situation, it is 
observed that if the probabilities 2,, and 7,, are small the 
probability of duplicate units is negligible. It will be 
assumed in the following that the probabilities 1,, and 7p, 
are small, and in effect s,s, = 2. 


3. CROSS-SECTIONAL WEIGHTING AND 
ESTIMATION 


This section describes procedures that combine infor- 
mation from multiple panels of a repeated panel household 
survey for cross-sectional estimation of population para- 
meters. The discussion is confined to estimation of totals. 
A uniform approach to cross-sectional estimation for house- 
holds and individuals is presented. This approach is based 
on the production of a set of weights for the combined panel 
sample that yield design-unbiased estimators of cross- 
sectional totals. Essentially, it involves the construction of 
a combined cross-sectional sample by means of an adjust- 
ment of the sampling weights of units from the temporal 
domains of the different panels that represent identical 
temporal domains of the cross-sectional population. While 
the delineation of the various temporal frame domains is 
necessary for determining the coverage of parts of the 
cross-sectional population by the different panels, the 
identification of some of the corresponding sample domains 
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may not be possible under the operating procedures of a 
repeated panel household survey. For example, the 
information needed to determine whether or not a unit in 
the second panel belongs to the non-overlap frame domain 
a (see Figure 1) may not be available. In this section, both 
cases of identifiable and non-identifiable temporal sample 
domains are considered. The weight adjustment for the 
combination of the panels involves only sampled units, and 
takes no account of any changes (other than leavers) in 
household membership between waves. A “weight share” 
adjustment that handles such changes should follow the 
combination of the panels, as it can be applied readily only 
to the combined sample; see relevant discussion in section 
4, 


3.1 Identifiable Temporal Sample Domains 


Weighting options for the combination of the panels 


For the construction of a cross-sectionally representative 
combined sample, a panel survey scheme such as that 
depicted in Figure 1 is considered. In analogy with a 
standard multiple frame argument (Bankier 1986; Skinner 
and Rao 1996) the two samples s, and s, can be thought 
of as selected independently from the complete frame A 
according to the sampling designs p,(s,) and p,(s,), but 
with a fixed time lag between the two selections. Then the 
two sampling designs p,(s,) and p,(s,) induce a well- 
defined design p(s) on the set of samples s=s,Us, in A. 
Thus conventional estimators, based on a single frame and 
a combined sample, may be constructed from p(s). The 
standard approach, leading to the Horvitz-Thompson esti- 
mator, would be to assign sample units weights made in- 
versely proportional to their inclusion probabilities. The 
probability of inclusion of the i-th population unit in the 
combined sample, 2, = P(i€ 5), is equal to 1,, + Mp; — 14; Tp; 
if ie B, and equal to ,; if i¢a. The weight of the i-th unit 
of the sample is then w, = 1/m,. This weighting scheme can 
be used provided that it is possible to identify the common 
units in the samples s, and s,, so that the duplicate units 
can be eliminated. A simpler approach, especially for 
surveys with more than two panels, would be to assign any 
unit ¢B a weight made inversely proportional to the 
expected number of selections of the unit, that is, inversely 
proportional to 2,, + 2,,. This weighting scheme, proposed 
by Kalton and Anderson (1986) for multiple frame surveys, 
does not require identification of duplicate sample units. 
Now, consider the sample domains s,,=s,B and 
s,=8,a of s,. Also, let a value y, be associated with 
population unit i for some population characteristic, and 
define the population total Y, =X, y,(= Ly, +L, y,)- 
Then, employing the latter weighting scheme the unbiased 
estimator 
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of the total Y, can be constructed. On the assumption that 
the probabilities 7%, .andin, fonees 1B are small, the 
estimator Y, is approximately equal to the Horvitz- 
Thompson estimator. 

The approach leading to the estimator (1) is not in 
general feasible, since the determination of the weight 
Wa eek meas for i€¢s{B requires knowledge of z,, 
for units in s,, and knowledge of Tp: for units in s_,. This 
is difficult or impossible to ascertain in household surveys 
because of stratified multistage sampling. In multiple-panel 
household surveys additional complications arise from the 
time element. For units that move (e.g., to another stratum) 
in the time between the selection of the panels it is impos- 
sible to determine both z,; and 7,.. 

An alternative strategy needs to be considered for devel- 
oping weights for the sample overlap domain s/B. One 
approach that provides a general framework for handling 
this problem requires information on the probability of 
inclusion in only one of SOs, thus avoiding the diffi- 
culty noted above. The essence of the alternative approach 
considered here is to associate with the i-th unit from the 
overlap frame B a number p, (0 < p; < 1) when the unit is 
selected in s,, and the number 1 ~-p, when the unit is 


selected in s ie and then define the weight of the unit as 


wi =p, —Hies,}+(A-p)—Hies\}, icB)"@) 
pj Mai 


where / is the usual sample membership indicator variable. 
Clearly, E(w; )=1 under p(s), and thus the use of the 
weights w; will yield unbiased estimators yy =Lp Ww; y, for 
the total Y, = ee y,, for any choice of Boastants p, Satis- 
fying 0 <p, < 1, and for any sampling designs p,(s,) 
and Palsy). carat (2) can be written alternatively as 
Ww; = =P; W,, + (1 - p,)W,;, with the obvious definition of the 
weights Wy and w,. associated with the samples s, and 
s,. Thus, the class of weighting schemes defined by equa- 
tion (2) consists essentially of different weighted com- 
binations of the weights in the original samples s, and s,. 
The limits on the values of p, ensure that the weight w; 
will be nonnegative. Note that the intractable weight 
W; = (M,, +p) 1 for iesNB, used in (1) is a special case 
of Ww, malt DP; = Tp, (Tey. + a 1 

Evidently, the weighting scheme defined by (2) does not 
eliminate duplicate units that fall in both samples. If the 
operational constraint to exclude from s, individuals 
already selected in s, is imposed, the second term in the 
right-hand side of (2) should be modified to (1 -p;) 
[x,,(1 -m,,)] Wies,,,i€s,} to ensure that E(w; ) = 1. 
This, however, may be impossible to do since it requires 
that the inclusion probabilities of the sampled units be 
known over both frames. Note also that under the constraint 
of excluding duplicate units, the two samples will not be 
independent. Nevertheless, as it is assumed that both proba- 
bilities m,, and m,, are small, the probability of duplicate 
units will be negligibly small, and hence any bias resulting 


from using the tractable weighting scheme defined by (2) 
would also be negligible. On this assumption, the two 
indicator variables in (2) should be understood to satisfy 
Hieés, ities, }=0. 

The question arises now as to an optimal choice of p,, 
for any i€s{B, according to some criterion of optimal 
weighting for the combined sample. One approach is to 
choose the p, to minimize the variance of the Se 
total 6 = LW; y+ &,W,¥» where w, =(n,,)" ges } 
for feo Hoven minimization of the variance of Ye with 
respect to p, for all ies MB is not tractable. A simpler 
option is to restrict the class of weighting schemes defined 
by equation (2) to one in which the weight adjustment 
factors are specified not at the unit level but rather at a 
higher level, which may be a stratum or the entire overlap 
frame B. Further discussion on the level of adjustment is 
deferred to the last part of this subsection. It suffices for the 
development of the weighting procedure to consider next 
the case involving a uniform weight adjustment factor p for 
the entire frame B. 


Determination of the value of p. Issues of practicality and 
efficiency. 

The class of weighting schemes defined by equation (2) for 
the frame B, with uniform weight adjustment factor p, 
generates a class of unbiased estimators for the overall total Y, 
of the form 


Gaye dO a 0 ase ee (3) 
where Yi and % are deter Horvitz-Thompson esti- 
mators of. f based on Sp, and s_,, respectively, and Y 
the Horvitz- “Thompson estimator of Y, based on s_. ‘he 
limit values of p yield two special cases of the estimator 
¥ i , in both of which the overlap domain total Y, is 
estimated from one panel only. When p is set equal to zero 
in (3), the resultant trivial estimator Be for the entire popu- 
lation is based only on s,. More notable is the case with p 
set equal to one in (3). The implied simple unbiased estima- 
tor yy - ety ce Yr would be the natural estimator in a panel 
survey with one ‘panel and a supplementary cross-sectional 
sample, with the units in that sample being “screened” and 
only the units in the domain of new entrants being enume- 
rated. In such a context this simple estimator would be a 
special case of a “screening” multiple frame estimator, the 
special feature being the temporal nature of the non-overlap 
frame domain a. In the present context the screening esti- 
mator appears inefficient because information in the sample 
domain s_, is not utilized. Better use can be made of data 
from both panels by combining s, and s,,, using an 
optimal p that is based on the minimization of the variance 
of ge The optimal value of p is given by 


A 


Var (Y, y+ Cov, ® Me) 


Pate se a (4) 
Var ( Ay) + Var ( Yu) 
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The variance and covariance terms in (4) are unknown, but 
could be estimated from the sample data, in which case the 
chosen p would actually minimize the estimated variance of 
ye. There are many drawbacks associated with this choice 
éf value for p. Generally, estimation of the optimal p is not 
easy; in surveys with more than two panels it would be very 
inconvenient to estimate the required set of such weight 
adjustments. Also, a sample estimate of the optimal p in (4) 
adds variability to the estimator y® , and complicates the 
estimation of its variance. Moreover, the dependency of the 
estimated optimal p on the sample data entails E(w; ) # 1 
for i¢ B, which disturbs the unbiasedness of the estimator 
(3). It is to be noted that the condition E(w; ) = Lis also 
necessary for the validity of the weight share method (see 
section 4) to hold when applied to the combined sample s at 
any wave after the selection of the second panel. 

An alternative choice for the value of p is based on the 
minimization of the variance of the common-frame compo- 
nent Y? =pY, +(1-p)Y,_ of the estimator ¥/ in (3). 
This restricted ‘minimizatiot, which ignores the typically 
small domain estimator ig gives the value 


Var(Y, ) 
ab 


Cee eer enTer ae OPENERS (5) 
Var(Y, ) + Var(¥, ) 


which is independent of the covariance term, and always 
lies between zero and one. Minimizing the variance of Y Fa 
conditional on the realized value of the random size n,, of 
the sample domain s_,, then using the well-known variance 
formula for the estimator of a total under simple random 
sampling, and disregarding finite population corrections, it 
can be shown that (5) may be approximated by 


n,/d 


ae BB 
nape met rae im ce) 


where n, is the size of the sample s,, and d,, d_, are the 
design effects associated with s, and s_,. The calculation 
of the value of p’ requires estimates of the two design 
effects, which need not be based on s, and s_,. Suitable 
approximate values of d, and d_, may be available from 
other surveys with the same sampling designs as the two 
panels. However, because of the dependency of p’ on the 
characteristic y through d, and d,, a different set of 
weights needs to be calculated for each characteristic of 
interest. Besides making the estimation process opera- 
tionally inconvenient, the different sets of weights may lead 
to inconsistencies among estimates. A compromise solution 
is to obtain approximate values of d, and d_, preferably for 
a count variable associated with a large population domain. 
A similar compromise solution is implicit in the approach 
of Skinner and Rao (1996) to estimation in dual frames. It 
is to be noted that since p’ depends on the characteristic y 
only through the ratio d,/d_,,, the loss of efficiency for 


WS 


estimators of totals of other characteristics should not be 
substantial. It is to be noted further that because of the time 
lag between the selection of the two panels, the design 
effects will be different, and thus present in (6), even when 
the sampling designs for the two panels are identical. By 
using estimates of the design effects from external sources 
the randomness of p’ is due only to the random size of the 
sample domain s_,. Since the size of the sample s, is 
usually very large, and the size of the overlap frame B is 
typically only a little smaller than the size of the complete 
frame A, the size n,, of the sample domain s,, must be 
nearly constant, and thus the unbiasedness condition 
E(w; ) = 1 will hold approximately. 

Some loss of efficiency will be incurred by ignoring ie 
in deriving an optimal value for p, but this loss may bé 
insignificant given the relatively very small size of the 
domain a in most household panel surveys, because of the 
typically small time lag between panels. To assess this loss 
of efficiency, let Y ia and Y ue denote the estimator i in (3) 
with the value of p given by (4) and (5), respectively. Then, 
a simple calculation gives 
ae ny Coy ey.) 

Var(Y, )- Var(¥/) = ———_—_—*>+— 
Var(Y, ) + Var(Y, ) 


Var(Y, var (Y, ) 

z a a 
Var(Y, ) a Var(Y, ) 
B ab 


=p Var), 


so that an upper bound for the efficiency loss can be 
obtained as 


Var(¥? ) - Var(¥?) = Var(Y, ) 


Var(Y/)) Var(Y;’) 
Given the usually very small size of Ye relative to Yun (the 
size of the domain a is approximately one fortieth ‘of the 
size of the complete frame A in the case of the SLID) it 
appears that the loss of efficiency will be very small in most 
panel household surveys. 

An interesting question is whether a not Y, p? is more 
efficient than the simple * ‘screening” estimator ‘ 
Yi ie whose variance is Var(Y, }) =: Var(Y, Ve It can 
be ‘readily shown that Var (Y¢ N< Var (Y 4 aie Var ( Y, ) if 
Rovere Y, > < Var(Y, ). This condition certainly Holds 
if the covariance of ay and re is negative, which may be 
the case if the estimated characteristic differs between 
immigrants versus non immigrants. In general, this covari- 
ance may actually be positive because Y, and ihe are 
based on the same sampled area clusters. In’that case too, 
however, the condition will most likely hold, given the 
magnitude of Var (Y sy relative to Var (Y, ae and the 
magnitude of Var(Y, y) relative to Var (Y, : Indeed, the 
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sizes of the panel samples s, and s, are typically equal by 
design, although the effective panel sizes (i.e., realized sizes 
at any wave, adjusted for design effects) may be consi- 
derably different due to different attrition rates and design 
effects for the two panels. Also, with the sizes of the sample 
domains s,, and s, roughly proportional to the corres- 
ponding population “domain sizes, Var (Y, ) will be many 
times, say k, smaller than Var (Y, a Then’ 


2Cov(Y, ,¥,) < 24/Var(¥, )Var(Y, ) 


Var(Y, ) 
fp Nee 284 


vk 
so that a sufficient condition for the estimator Y 4 to be 
more efficient than the “screening” estimator is 


> 


The interpretation of this is that the sample domain s_, is 
not to be ignored when estimating Ke it Var (Y, &) is not too 
small relative to Var (Y, ). The Condiiion is ordinarily 
satisfied in panel househéld surveys. An additional argu- 
ment in favour of including s_, in estimation is its better 
quality relative to Sy since the latter is more liable to the 
potential bias effect of sample attrition. 

The simple approximate weight adjustment factor p’ 
given by expression (6) affords an efficient combination of 
panel samples, accounting for the precision of Ye relative 
to that of ie through the effective sample sizes Np /d,, and 

,(@ 5 _ Thése effective sample sizes are time- dependent 
Aa their ratio (and hence p’) should be quite stable 
over the period of panel te tees Regarding variance calcu- 
lations, since n,, is typically nearly non-random, the 
adjustment factor p’ can be conveniently treated as 
constant in any variance estimation procedure. 

It is important to emphasize here that additional gains in 
efficiency will result from the incorporation of auxiliary 
information into the weights through a calibration weight 
adjustment to known population totals. 

Finally, it should be remarked that if the criterion in the 
choice of the value of p is the minimization of the mean 
square error of the common-frame component Ve Pea 
p Y, re ((Hl -p)Y, of the estimator y?, then it can be sastly 
shown that wheh the biases of 6 and Vi are equal the 
optimal value of p is the same as the one given by (5). The 
biases are not expected to be equal, though; for instance, the 
different sample attrition rates for the two panels may result 
in different levels of aes It is clear that the bias of the 
linear combination Y? = pY, + (1 -p)Y, » though not 
minimized if p is as in (5), is ee smaller than the 
larger of the two component biases. Other complexities 
aside, the unavailability of good estimates for the two biases 
renders the criterion of minimum mean square error 
impracticable. 


Generalization to multiple panels and discussion of 
alternative approaches. 


The weighting procedure described above applies to the 
simple situation of a two-panel survey at the start of the 
second panel. At later survey waves an additional non- 
overlap frame domain, denoted by b, may be formed by 
returning leavers of the frame B. Units from 5 originally 
selected in the first panel were not present when the second 
panel was selected. Clearly, the weights in the non-overlap 
sample domain s, are not to be adjusted for the purpose of 
combining the two panels. Furthermore, the value for p will 
not be affected, as it is based only on the overlap domain of 
the combined sample. As with ignoring the sample domain s, 
in determining the value of p, ignoring the much smaller, 
possibly void, sample domain s, will have negligible 
impact on the efficiency of derived estimators. 

The simplicity of the proposed weighting procedure for 
the combination of two panels makes its generalization to 
surveys with more than two overlapping panels straightfor- 
ward. The most likely generalization in practice would 
involve three panels. The construction of a combined cross- 
sectional sample would then involve the adjustment of the 
sampling weights of units from temporal domains of the 
different panels that represent a common temporal domain 
of the cross-sectional population. For each common tempo- 
ral population domain the weight adjustment factors will be 
based on the relative effective sample sizes of the corre- 
sponding panel domains, in analogy with expression (6), 
and will add up to one. The number of common temporal 
frame domains, and hence the number of the corresponding 
independent sets of adjustment factors, will be quite small 
because of the high degree of nesting in the sequence of 
panel frames. For instance, for a three-panel survey there 
will be one set of three adjustment factors and one set of 
two. 

Returning now to an earlier point, varied weight adjust- 
ment factors may be specified at a lower level of sample 
grouping, such as a certain stratification level. For reasons 
of feasibility (identical stratification for the two panels is 
required for that level) and operational convenience, a high 
level of stratification should be chosen. The natural choice 
is a superstratum level, at which all other weighting and 
estimation procedures are carried out independently for 
each superstratum. In the SLID, such superstrata are the 
Canadian provinces. The advantage of specifying weight 
adjustment factors at the superstratum level is improved 
efficiency, since an optimal or nearly optimal weight adjust- 
ment factor p can be determined for each superstratum. 
This will be particularly advantageous if the ratios of the 
effective sample sizes of the panels are very different 
among the superstrata, as is the case in the SLID. 

Alternative estimation techniques from the general 
theory of multiple frame surveys with complex designs (for 
an account, see Skinner and Rao 1996, and Singh and Wu 
1996) would produce estimators similar in form to the 
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estimator (3) if adapted to a multiple panel survey with 
overlapping panels. Such techniques, though, are not 
preferable in general for reasons similar to those stated in 
the discussion following equation (4); the ‘pseudo- 
likelihood” method of Skinner and Rao (1996) is also not 
applicable in surveys with more than two panels. Further- 
more, while the weight adjustment proposed in this section 
essentially combines the panels, on the basis of an efficient 
combination of Horvitz-Thompson estimators, the standard 
multiple frame methods ordinarily combine ratio-adjusted 
or, more generally, calibrated estimators derived separately 
using the sample from each frame. In the context of a 
household panel survey, the components from each panel 
would be calibrated estimators incorporating all the weight 
adjustments, including the “weight share” adjustment, 
carried out separately for each panel. This would be in 
conflict with the application of the “weight share” adjust- 
ment to the combined sample, to be proposed in section 4. 
It is interesting to note that apart from this complication 
there are many possible limitations that could render a 
separate calibration of each panel problematic or unfeasible. 
It may be remarked first that a proper separate calibration of 
the panels is possible only when the various temporal 
sample domains are identifiable. Furthermore, a calibration 
involving the same auxiliary variables for each temporal 
domain of each panel would be required in order for the 
final weights to satisfy all calibration constraints. But since 
all temporal frame domains (except the one that is common 
to all panels) are typically very small, a calibration in- 
volving a large number of auxiliary totals (as is customary 
in household surveys) would not be sensible for reasons of 
potential bias and loss of efficiency of derived estimators. 
Moreover, auxiliary totals for frames of old panels that 
account for the loss of population units may not be avail- 
able. It should also be pointed out that accurate auxiliary 
totals most likely would be unavailable if the frame of each 
panel were augmented with new entrants who live with 
individuals of the original frame of the panel. Such would 
be the situation if the “weight share” procedure, which 
assigns a basic weight to new entrants living with selected 
individuals, were to precede the combination of the panels. 

Notwithstanding other difficulties, it is possible in prin- 
ciple to use standard multiple frame methods to combine 
the panels, avoiding a separate calibrating weight adjust- 
ment, with the exception of the dual-frame pseudo- 
likelihood method of Skinner and Rao which in the setting 
of Figure | would require a simple ratio weight adjustment 
for Sos. ands. 

Lastly, a known drawback of various multiple frame 
estimators is that their optimality depends on the estimated 
characteristic of interest. For the proposed method this 
dependency appears to be weaker, because the optimal p’ 
in (6) depends on the particular characteristic only through 
a ratio of panel design effects, estimated from an extraneous 
source. 


Tih 


3.2 Non-identifiable Temporal Sample Domains 


It has been assumed thus far that the units of the non- 
overlap sample domain s_(<s,) can be identified. How- 
ever, the information needed to determine whether a unit in s i 
belongs to the frame domain a, of new entrants into the 
population after the start of the previous panel, may not be 
available for all units of s,. In that situation the weighting 
process described above would combine the two samples 
Sp and s, without distinguishing between the domains s_, 
and S, ie 5,, SO that the weights of units in 5, would also 
be multiplied by 1 -p. The estimator ve in (3) would 
collapse then to 


A 


Ga rd (Ceo Get (7) 


The effect of this error is the underestimation of the total Y 
for the population domain a by the factor p. Part of the 
domain a, though, consists of newborns, which can be 
identified in s, with certainty. Their weights could very 
well be excluded from the adjustment by the factor 1 - p, 
but that would have no effect on cross-sectional estimation, 
unless newborns were part of the population of interest. 
Besides, adjusting the weights of newborns in s, by the 
factor 1 — p has the desirable effect of producing a common 
household weight. A calibration of the weights of the 
combined sample to known population totals of the 
complete frame A will lessen the under-representation of 
the rest of the domain a, which consists mainly of 
immigrants, but some bias may still result if the survey 
characteristics of the members of this part of the population 
are quite different from those of the members of the popu- 
lation domain B. Unless the time lag between the selection 
of the two panels is quite large, the size of this part of the 
population is very small, relative to the total population, and 
the potential bias effect on overall estimates of totals should 
be negligible. 

The optimal (i.e., variance minimizing) value of p in (7) 
is given now by 


Var(Y, ) 


= Sts RG St (8) 
Var(Y, ) te Var(Y, ) 


Disregarding finite population corrections it can be shown 
that (8) can be expressed as 
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with c = (Ng Sp)(Nq Say and where n,, n, are the sizes 
of the samples s, and s,; d,,d, are the design effects 
associated with s, and s, ae ae characteristic y: Noe 5 
are the sizes of the frames A and B; Ses: are the 
variances of the characteristic y in A and B. Noting that NV, 
may be only a little smaller than NV, (depending on the fine 
lag between the iO panels), ang assuming that the 
unknown variances Sy and Se are nearly equal, a good 
practical approximation of the Spurl pcan be obtained by 
simply setting Cc equal to one in (9). The assumption that the 
variances S, and ae are nearly equal is reasonable consi- 
dering the omnide of N, relative to that of N,. Approx- 
imate values of d, and d, available from athe: surveys 
with the same designs as the two panels could be used, 
preferably for a characteristic such as the size of a large 
population domain. Now, if and Yi denote the estimator 
ie in (7) when the weight eGtcaneu! Pp,’ in (9) is used 
Ah the true value of c and the approximate value c = 1, 
respectively, then ignoring finite population corrections the 
loss of efficiency of i relative to Va can be readily shown 
to be 

Var(Y_) - Var(Y 2 

aE ache ee ldetsnarl apices 

Var(Y,) Cc 


With a value of c most likely in the neighbourhood of 1.0, 
the loss of efficiency will be negligible. 

It is interesting to examine the efficiency of the estimator 
given by (7), with p” as in (8), relative to the optimal esti- 
mator given by (3), with p as in (4), used when the domain s , 
is identifiable. Let YZ and Ys denote these estimators, 
respectively. Then, using the inequality 
Cov7(Y, , ig K Var(Y, )Var(Y, > it can be shown that 
Var(Y, j= Var(P/ ve (p" =p 'WVar(¥, ae where p’ is as 
in (5). As already mentioned, in general corr iy ole 0, 


so that p’’ > p’ and hence Var(Y, )2 Var(Y, vale “Therefore, 
notwithstanding the use of the exact values of p” and p’ 
in the comparison, the approach taken in this subsection 
may in most cases result in reduction of the variance of 
derived estimators. A lower bound for the gain in efficiency 
relative to 7 would then be given by 

Viar (i) Vari) : Crap h 
Var(Y,) lp: 


An extension of the weight adjustment procedure 
described above to surveys involving more than two panels 
with non-identifiable temporal sample domains is straight- 
forward. There will be then as many weight adjustment 
factors, adding up to one, as there are panels. This very 
practical procedure will produce good cross-sectional esti- 
mates in multiple panel surveys in which the time lag 
between the selection of the panels is not large. Otherwise, 
the potential for bias due to the domain identification error 
may be of concern, mainly for estimates related to 


subpopulations composed in substantial proportion of new 
entrants. 


4. THE WEIGHT SHARE METHOD FOR THE 
COMBINED PANELS 


This section describes the application of a weight adjust- 
ment method, known as the weight share method, to the 
combined panel sample at any wave after the start of the 
most recent panel. This weight adjustment is necessary 
because of the changes in the household membership after 
the selection of the panels. 

The weight share method is a cross-sectional weighting 
procedure that assigns a basic weight to every individual in 
a panel household at any wave after the first. In particular, 
the weight share method, as applied to a single panel, 
assigns a positive weight to non-selected individuals who 
join households containing at least one individual selected 
for the original sample. Following Lavallée (1995), in this 
paper such households are termed longitudinal households, 
while the non-selected individuals living in longitudinal 
households are termed cohabitants. The cohabitants are 
distinguished into originally present cohabitants if they 
belong to the original (sampled) population, and originally 
absent cohabitants if they are new entrants to the pop- 
ulation. Other problematic situations that can be handled by 
the weight share method involve non-selected households 
formed after the first wave by members of separate 
originally selected households, as well as originally selected 
individuals who have subsequently moved to other longi- 
tudinal households. For a detailed discussion of the weight 
share method for a single panel, see Kalton and Brick 
(1995), and Lavallée (1995). For the purpose of applying 
the weight share method to a multiple panel survey the 
following need to be considered. In multiple panel surveys, 
the original population for the combined panels is the union 
of the populations covered by the different panels at the 
time of their selection. Accordingly, the original sample 
consists of all selected units in the combined panel sample. 
Thus, an originally present cohabitant is an individual that 
was eligible for selection in any of the panels. In this 
approach then, at any wave after the selection of the most 
recent panel a cohabitant is distinguished into originally 
present or originally absent with respect to the original 
combined panel sample, not with respect to each original 
panel. Notably, at the first wave of a new panel, or when a 
top-up sample is used, all cohabitants are originally present. 
On the other hand, application of the weight share method 
separately to each panel (before combination) would 
require more precise information on the eligibility of the 
cohabitants for selection in each of the various panels, in 
order to distinguish the originally present cohabitants from 
the originally absent cohabitants and to identify the 
temporal domain that includes each of the cohabitants. 
Such information most likely would be unavailable. 
Moreover, combining the panels after the weight share 
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procedure would require a very complicated set of specifi- 
cations in order to ensure that a suitable weight adjustment 
factor would be applied to each sampled unit. For instance, 
with the inclusion of the originally absent cohabitants into 
the panels through the weight share procedure, the frames 
of the panels will be different at each survey wave, thereby 
complicating the determination of the various temporal 
domains. Lastly, it should also be pointed out that in 
multiple panel surveys sampled individuals may move from 
one panel to another panel between waves during the time 
period of panel overlap, and non-sampled households may 
be formed by members of originally selected households 
from different panels. Thus, the panels are truly distinct 
(and independent) only with respect to the time of their 
selection. 

It follows from the foregoing considerations that the 
weight share method for multiple panels is to be applied to 
the combined panel sample, and not to each panel sepa- 
rately. Then, with the prescribed distinction of the two types 
of cohabitants, the case of the weight share method for a 
multiple panel survey reduces to the case of a single panel 
survey. As a desirable consequence, the application of the 
weight share method to the combined sample will yield 
always a common weight for all members of the same 
household. The following is an exemplification of the 
proposed weight share procedure for multiple panel 
surveys, involving the simple case of two panels. 

Starting with a survey setting as depicted in Figure 1, 
with two overlapping panels at the time point of the start of 
the second panel, let there be N individuals in the pop- 
ulation at a later wave (time 1), with N, individuals in 
household H,, say; i=1,...,H and LN,=N. Let M, 
denote the number of individuals in household ‘H, at time 
t that belong to the original population, with M,. and M_, 
individuals from the original frame domain B and the non- 
overlap frame domain a, respectively, so that M, = 
M,,; + M_;- Some, but not all, of the numbers M,,, M_,, and 
N, - M, may be zero for any particular household. Now, 
with the random weights of individuals in B and a as 
defined in section 3.1, and with the weights of the N,-M, 
originally absent cohabitants in ‘H, being identically equal 
to zero, the weight share method defines a common weight 
for every individual in H, (including new members) as 


M, 
hee 

W, M. », Wix> (10) 
where w,, is the weight of the k-th household member that 
belongs to the original population. Clearly then E(w,) = 1 
for each household for which M, + 0, whereas E(w,) = 0) 
if M,=0, since w,#0 only if M,>0. For the survey 
characteristic y, the total for the population of individuals at 
time t can be expressed as Y= bY, ', y,,, where y,, is 
the value of y for individual k in household ‘H;. Then, an 
estimator of Y is given by 


179 
H N; 
Ye Ss Wil Nik 

ie taKet 
H Mp; Mj Nj-M, 

= SA pS eS La a 
i=1 pal k=l al 

= Yael aT ses (11) 


with w, as in (10), with A“ denoting the set of individuals 
not in frame A, and with the obvious notation for the right 
hand side of (11). The estimator Y in (11) is given as the 
sum of three estimators, ve Y, and Y 4c for the totals 
related to the population domains B,aand A‘, respectively. 
The estimators re and 1 are unbiased, even though they 
are based on sets POF units that may not be identical to the 
original samples s, Us, and s_, respectively. For example, 
the estimator Y, is based on a set of units consisting of the 
remaining units of the original combined sample s,Us_, 
from frame B, and possibly of cohabitants originally present 
in B. The estimator ror is not unbiased for Y,-, because 
individuals in A‘ ho live in households that contain no 
members of the original population are not represented in 
the panel survey. Nevertheless, the estimator Y yeas 
unbiased for the total corresponding to the rest of A‘, 
which is represented in the combined panels by the origi- 
nally absent cohabitants. In the special case when time t 
coincides with the start of the second panel (or with the 
time of selection of a supplementary sample), A‘ =a, 
N= Me, and the estimator Y = Y, + Y_ is unbiased for Y. It 
should be noted here that if the weights of the responding 
individuals at time ¢ are adjusted for nonresponse, the 
relationship E(w;) = 1 may hold only approximately, and 
in that sense the resulting estimators may be only approxi- 
mately unbiased. 

It is important to note that the estimator Y in (11) can be 
expressed as 


=] 


where Y; = mo y,, is the total for household ‘H,. Thus, 4 
is also an estimator of the household-level total at time t. 
As with the weight adjustment involved in the combi- 
nation of panels, the weight share adjustment may also be 
carried out at a superstratum level, say province, for the 
combined sample of each province. In this approach, those 
individuals who at time ¢ reside in a province other than the 
one in which they resided at the time of selection of any of 
the panels are treated as originally absent, since they were 
not members of the original population of their new 
province. In particular, interprovincial movers (selected or 
non selected in their original province) who are found in 
longitudinal households in their new province at time ¢ are 
treated as originally absent cohabitants. When a top-up 
sample is used at time #, these interprovincial movers are 
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treated as originally present cohabitants. The application of 
the weight share procedure separately for each superstratum 
enjoys certain operational and statistical advantages over 
the standard weight share procedure. An account of the 
comparative merits of the two approaches is given in 
Merkouris (1999). 


5. INTEGRATION OF VARIOUS WEIGHT 
ADJUSTMENTS 


In addition to the weight adjustments described so far, 
other adjustments to the weights of a panel household 
survey may also be required. The integration of the various 
weight adjustments is briefly outlined below. 

The first adjustment, applied in relation to the original 
sample units, is for wave nonresponse, which arises when 
a sampled unit responds for some but not all of the waves 
for which it was eligible. For a discussion on weight adjust- 
ment for wave nonresponse, see Kalton and Brick (1995). 
The adjustment is made separately to the different panels at 
each wave. 

The second adjustment is for the combination of the 
samples of the various panels into one sample for cross- 
sectional estimation. It applies to the weights of the sampled 
units of the panels, adjusted for wave nonresponse, and 
employs the method described in section 3. 

The third adjustment involves the application of the 
weight share procedure to the combined panel sample at 
any wave after the start of the most recent panel, as 
described in section 4. 

Finally, in the weight calibration adjustment the weights 
of the combined panel units are adjusted so as to make the 
estimated totals for certain auxiliary characteristics equal to 
known population totals for these characteristics at the 
current wave, which in the simple case as in Figure 1 
correspond to totals of the complete frame A. In more 
general situations, after the selection of the most recent 
panel the calibration totals will include the new entrants 
into the population. Note that in the absence of a top-up 
sample the new entrants will be represented in the panels 
only by the originally absent cohabitants. Calibrating the 
weights of the combined sample to population totals of each 
of the different temporal domains (when the panel units 
from these domains can be identified) may not be feasible 
or sensible for reasons already noted in section 3.1. 


6. SUMMARY AND CONCLUDING REMARKS 


The weighting procedures described in this paper can be 
used to combine information from multiple panels of a 
repeated household survey for cross-sectional estimation in 
a fairly general setting involving panels with given designs; 
design issues regarding determination of optimal sampling 
fractions for the panels, in conjunction with efficient 


combination of the panel data, are beyond the scope of this 
paper. It has been shown that although a multiple panel 
survey can be viewed as a special type of multiple frame 
survey, its distinctive dynamic character renders conven- 
tional multiple frame estimation procedures problematic or 
even non applicable. The proposed weighting procedures, 
which account for the population and panel dynamics, 
involve a simple weight adjustment for each panel that is 
proportional to the effective panel size. These procedures 
are operationally convenient for any number of overlapping 
panels, and for different situations regarding the 
identifiability of various temporal panel domains. Theo- 
retical and practical issues related to the application of a 
weight share adjustment, to the calibration weight adjust- 
ment and to the integration of the various weighting 
procedures involved in a multiple panel survey have also 
been addressed. In particular, it has been argued that the 
weight adjustment for the combination of the panels should 
precede the weight share adjustment, with calibration being 
the final weight adjustment. A detailed empirical study of 
issues pertaining to the determination of weight adjustment 
factors for combining two panels of the SLID, based on the 
methodology of this paper, is described in Latouche et al. 
(2000). The variance of cross-sectional estimators has been 
discussed in this paper only in the context of efficient 
combination of panels. Variance estimation issues related 
to changes in the sample over time, particularly to moves 
from stratum to stratum, are discussed in Merkouris (1999). 
It is to be remarked, in conclusion, that the quality of a 
cross-sectional estimation procedure depends on the identi- 
fiability of various overlap temporal sample domains; on 
design features of the survey, such as the duration of (and 
the lag between) the panels and the use of a supplementary 
sample at any survey wave; and on the adequacy of the 
information on cohabitants required for the application of 
the weight share method. 
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Producing Small Area Estimates From National Surveys: 
Methods for Minimizing use of Indirect Estimators 


DAVID A. MARKER’ 


ABSTRACT 


National surveys are usually designed to produce estimates for the country as a whole and for major geographical regions. 
There is, however, a growing demand for small area estimates on the same attributes measured in these surveys. For 
example, many countries in transition are moving away from centralized decision-making, and western countries like the 
United States are devolving programs such as welfare from Federal to state responsibilities. Direct estimates for small areas 
from national surveys are frequently too unstable to be useful, resulting in the desire to find ways to improve estimates for 
small areas. While it is always possible to produce indirect, model-dependent, estimates for small areas, it is desirable to 
produce direct estimators where possible. Through stratification and oversampling, it is possible to increase the number 
of small areas for which accurate direct estimation is possible. When estimates are required for other small areas, it is 
possible to use forms of dual-frame estimation to combine the national survey with supplements in specific areas to produce 
direct estimates. This article reviews the methods that may be used to produce direct estimates for small areas. 


KEY WORDS: Small area estimation; Direct estimation; Stratification; Oversampling; Dual-frame estimation. 


1. INTRODUCTION 


Throughout the world there is an increased demand for 
small area estimates. During the 1990s countries in transi- 
tion moved away from centralized decision-making, re- 
quiring accurate estimates of local economic and demo- 
graphic conditions. In the United States the Federal govern- 
ment has been moving responsibility for many social 
programs to the 50 states. Evaluating the success of such 
efforts requires accurate estimates for each state. Some 
programs such as the Small Area Income and Poverty 
Estimates (Citro and Kalton 2000) are required at much 
smaller levels of geography, for example for thousands of 
school districts. Regardless of the best plans of survey 
designers, “The client will always require more than is 
specified at the design stage” (Fuller 1999, page 344). 

Ideally such estimates would be produced from direct 
(design-based) estimators. Unfortunately, at small levels of 
aggregation, the direct estimates are too unstable to be 
published and/or used for policy purposes. As a result there 
has been a great deal of interest in developing a range of 
indirect estimation techniques (Marker 1999; Rao 1999; 
Ghosh and Rao 1994). 

This paper approaches this problem from a different 
perspective, how to minimize model-reliance through good 
survey design. It will never be possible to anticipate all 
survey uses, or to allocate sufficient sample sizes to all 
domains of interest, so indirect estimators will always be 
needed. It is possible, however, to make design choices that 
will greatly improve the ability of national surveys to 
support direct estimation for many small areas. Such 
choices can also improve the ability of surveys to be used to 
produce indirect estimates where they are needed. This 


paper is an update of the excellent paper by Singh, 
Gambino and Mantel (1994) on the same topic. Design 
issues that will be considered include stratification and 
oversampling, combining multiple years of data, harmoni- 
zation across surveys, dual-frame estimation, and measuring 
the accuracy of estimates. 


2. STRATIFICATION AND OVERSAMPLING 


Deciding on the optimal stratification and oversampling 
scheme for any national survey is a compromise across 
many variables of interest. Optimizing stratification and 
oversampling between national estimates and small area 
estimates should also be a compromise. By giving up some 
national accuracy it is often possible to greatly improve the 
accuracy for many small areas. Some of these small areas 
may then be able to support accurate design-based esti- 
mates. Other small areas will still require model assistance, 
but the stratification may allow for unbiased (but variable) 
estimates that can be incorporated into the model-based 
estimates. As the following example demonstrates, strati- 
fication alone is helpful, but limited, in its ability to improve 
small area estimates. 

The United States Current Population Survey (CPS), 
conducted by the U.S. Census Bureau, has stratified by state 
and unemployment rate since 1985. However, another large 
Census Bureau survey, the United States National Health 
Interview Survey (NHIS), stratified by region, metropolitan 
area status, labor force data, income, and racial composition 
until 1994. The resulting sample sizes for individual states 
varied from year to year and did not support unbiased 
state-level estimates. Due to random sampling, from 1985 


| David A. Marker, Westat, 1650 Research Blvd., Maryland, U.S.A. 20850, e-mail: DavidMarker@ Westat.com. 


184 


to 1994 two states did not have any sample included in the 
NHIS. This would not have happened with state strati- 
fication. 

Beginning in 1995 the NHIS stratification scheme was 
replaced by state and metropolitan status. Table 1 sum- 
marizes the number of states that have sufficient sample 
size in the 1995 NHIS to achieve various levels of accuracy 
for four different key health measures. The NHIS completes 
interviews with approximately 44,000 households con- 
taining 100,000 individuals. With a very strict constraint of 
a 10 percent coefficient of variation (CV) less than 10 states 
meet the standard for three of the four variables. Over half 
of the states meet the more lenient 30 percent CV for all 
four variables, but even this standard is not met for all 
states. 
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Figure | presents the ability of the NHIS to meet these 
accuracy standards for generic questions with prevalence 
levels of 0.01, 0.05, 0.10, 0.15, and 0.20 and design effects 
ranging from 1.00 to 6.00. (This variation in design effects 
is found on the NHIS, depending on the intra-household 
correlation and other clustering.) For prevalence rates above 
10 percent, almost all states can achieve the 30 percent 
criterion even for the largest design effects. However, there 
is a significant drop off in the number of states as the 
criterion is tightened, the design effect increases, or the 
prevalence rate drops. For rare events with even moderate 
design effects less than half the states can meet the weakest 
criterion and hardly any can make the tightest. 


Table 1 
Summary of the Number of States (out of 51, Including the District of Columbia) That Have the Required 1995 
NHIS Sample Size to Achieve a CV of 30-, 20-, and 10-Percent for Four Selected Variables 
(44,000 Households, 100,000 Individuals) 


Coefficient Percent uninsured: Percent uninsured: 
of all ages under 19 
Variation (CV) (p= 13.5%) (p= 12.2%) 
30-percent 42 31 
20-percent 31 13 
10-percent q 2 


; E 
45 
40 


35 


Number of states 


1.00 1.50 3.00 6.00 
p=0.01 


1.00 1.50 3.00 6.00 
p=0.05 


1.00 1.50 3.00 6.00 


p=0.10 
4 Design Effects for Each of 5 Prevalence Levels 


Percent uninsured: Percent smokers: 


low income children 18 and over 
(p= 20.4%) (p = 25.2%) 

28 45 

10 36 

2 14 


030% cv 
20% cv 
610% CV 


1.00 1.50 3.00 6.00 
p=0.20 


1.00 1.50 3.00 6.00 
p=0.15 


Figure 1. Number of States Meeting CV Criteria for 1995 NHIS (44,000 Households, 100,000 Individuals) 
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Stratification by small area assures a fixed sample size 
will be assigned to each small area, and thereby fixes the 
accuracy associated with direct estimates. Without such 
stratification, it may even be impossible to produce unbi- 
ased estimates for small areas that do contain some sample, 
because the probabilities of selection for sampled cases are 
a function of their entire stratum, both inside and outside 
the small area. For example, this can occur when part of a 
small area is in a stratum that crosses small area boundaries, 
and the sampled PSUs are in other small areas. To produce 
direct estimates requires either collapsing strata boundaries 
or small area boundaries. 

By oversampling small areas it is possible to signifi- 
cantly improve the accuracy of direct estimates for these 
areas, while only incurring a minimal loss in accuracy for 
national estimates. As a simple example, consider a national 
survey with 5,000 respondents but where under a random 
sampling scheme 10 of the small areas would only receive 
100 cases each. Alternatively one could double the sample 
size to 200 in each of these small areas while retaining the 
national sample size of 5,000. The effective sample size for 
national estimates would be reduced by this oversampling, 
but would remain more than 4,000, so the CV of national 
estimates would increase less than 10 percent. The CV for 
estimates in each of the 10 small areas would decrease 30 
percent because the sample size was doubled. 

Beginning in 1999 the U.S. National Household Survey 
on Drug Abuse has combined stratification and over- 
sampling to produce direct estimates for every state 
(Chromy, Bowman and Penne 1999). 

Singh et al. (1994) provided an example of oversampling 
small areas in the Canadian Labour Force Survey. Seventy 
percent of the sample was allocated to provide optimal 
national and provincial estimates. The remaining 30 percent 
were used to supplement small areas to improve their esti- 
mates. National CVs were increased between 10 and 20 
percent by this compromise design, but unemployment 
insurance regions’ estimates had CV reductions as large as 
50 percent. 

A similar design was used for the 2000 Danish Health 
and Morbidity Survey. The survey included two national 
samples, each of 6,000 respondents. An additional 8,000 
respondents were distributed to assure that at least 1,000 
respondents would be in each county. 

The effect of oversampling on CVs can also be seen by 
comparing the 1996 CPS and 1995 NHIS with America’s 
1996 Survey of Income and Program Participation (SIPP). 
The CPS not only stratified by state, it also oversampled 
smaller states. The NHIS stratified by state but didn’t 
oversample based on geography (minority groups were 
oversampled, but they tend to be located in the more 
populous states). In contrast, the SIPP did not stratify by 
state nor did it oversample. The ratio of the largest to 
smallest state sample size was 11:1 for CPS, 60:1 for SIPP, 
and 110:1 for NHIS. The corresponding ratio of CVs was 
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3.5:1, 7.5:1, and 10.5:1. Oversampling resulted in the CVs 
for the smallest states being reduced by almost a factor of 
two-thirds! 

It is important to remember that oversampling based on 
geography doesn’t necessarily reduce the variability in 
other domains of interest, for example demographic sub- 
groups. The ratios of largest to smallest state sample sizes 
in the CPS were 15:1 for children, 20:1 for the elderly, 
500:1 for Blacks, and 800:1 for Hispanics. 

The 1994 U.S. National Employer Health Insurance 
Survey (NEHIS) oversampled smaller states to balance the 
need for accurate state and national estimates. The overall 
sample of 40,000 establishments had to be spread across all 
51 states to provide direct estimates for all states. Three 
options were considered: 


Option A: Optimal national allocation (based on total 
employment in the state) yielded very small 
sizes in some states. 

Option B: Equal allocation to all states yielded inefficient 

national estimates. 

Option C: Minimum 400 completes per state (allocate 

based on number of employees to the 0.3 

power). 


The corresponding ratio of largest to smallest state CVs 
were 7.2:1 for Option A, 1:1 for Option B, and 1.8:1 for 
Option C. Compared to Option C, the national CV with 
Option A was 17 percent lower, but with Option B was 22 
percent higher. Option C was selected over Option A since 
it reduced the variation in state CVs by a factor of 4 while 
only moderately increasing the national CV. 


3. COMBINING MULTIPLE YEARS 


An inexpensive way to increase the sample sizes in small 
areas is to combine cycles of a repeated survey. Combining 
k years of an annual survey increases the effective sample 
size not quite k times. The reason for this is that usually 
consecutive years of the same survey are conducted in the 
same primary sampling units (PSUs) and even adjacent area 
segments. This results in some correlation between years, 
somewhat reducing the effective sample size. 

One drawback to combining multiple years is that such 
estimates are slow to detect changes across time. If time 
series are a prime interest, alternative methods must be used 
to increase the sample size. 

Table 2 shows for the 1995 NHIS how many states can 
achieve different levels of accuracy by aggregating across 
two or three years. Aggregation clearly helps achieve CVs 
of 30 and 20 percent. Even aggregating 3 years can’t help 
many states achieve a CV of 10 percent. 
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Table 2 
Summary of the Number of States (out of 51) That Have the Required 1995 NHIS Sample Size to Achieve a 
CV of 30-, 20-, and 10-Percent; Aggregating Multiple Years for Four Selected Variables 
(44,000 Households, 100,000 Individuals). 


Percent uninsured: 


all ages under 19 

30-percent CV 

1 year 42 Sil 

2 years 46 35 

3 years 49 41 
20-percent CV 

1 year 31 13 

2 years 36 29 

3 years 42 31 
10-percent CV 

1 year 7 2 

2 years 14 3 


3 years 22. 7 


4. HARMONIZATION ACROSS 
SURVEYS 


Harmonizing questions across surveys is another 
inexpensive way to improve estimation. Eurostat has been 
making a major effort to harmonize a number of surveys 
both between countries and within. The European 
Community Household Panel Survey (ECHP) is an attempt 
to collect consistent information across the member coun- 
tries. Similar standardization is ongoing in each country’s 
Labour Force Survey. This harmonization across countries 
improves international comparisons. 

‘Harmonizing across surveys of the same population 
increases sample sizes, improving small area estimates. 
Statistics Finland has been harmonizing the process for 
collecting income and other variables in its surveys. The 
Permanent Survey on Living Conditions (POLS) at Statis- 
tics Netherlands uses a common procedure for collecting 
basic information in a series of social surveys. 


Percent uninsured: 


Percent uninsured: Percent smokers: 


low income children 18 and over 

28 45 

36 50 

Sy) 51 

10 36 

24 44 

31 46 

D 14 

25 

4 32) 


Even if the questionnaire wording is consistent across 
surveys, the data may not be completely comparable. 
Different modes of data collection can cause differences, as 
can the placement of questions (Groves 1989). 


5. DUAL-FRAME ESTIMATION 


In some situations it is possible to supplement an 
in-person survey with telephone data collection, thereby 
increasing the sample size in a small area at more limited 
expense. The Dutch Housing Demand Survey is a national 
in-person survey. To produce small area estimates tele- 
phone supplementation is used in over 100 municipalities. 
Table 3 shows the size of the national in-person survey, 
telephone supplement, and total sample in ten selected 
municipalities. 


Table 3 

Dual-Frame Completes for Municipalities in the Dutch Housing Demand Survey 
Municipality In-Person National Survey Telephone Supplement Total 
Leek 56 569 625 
Marum 29 299 328 
Slochteren 44 456 500 
Zuidhorn 54 558 612 
Emmen 770 224 994 
Avereest 134 465 599 
Bathmen 24 506 530 
Dalfsen 157 466 623 
Deventer 316 335 651 
Diepenveen 47 336 383 
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Sirken and Marker (1993) described dual-frame estima- 
tion for the U.S. National Health Insurance Survey (NHIS) 
based on its 1985-94 design. Table 4 examines the same 
idea for the current design implemented beginning in 1995. 
The table compares the ability to produce state estimates 
with national in-person survey interviews and with unbiased 
dual-frame estimation using an unlimited number of supple- 
mental telephone interviews. (Up to 100, 200, and 2,000 
telephone interviews per state are required to achieve CVs 
of 30-, 20-, and 10- percent, respectively.) When a small 
area has a large percentage of households without tele- 
phones, no amount of telephone supplementation may be 
sufficient to achieve unbiased estimates with the desired 
accuracy. 

In such situations, it may only be possible to achieve a 
desired level of accuracy using a potentially biased esti- 
mator that combines all data regardless of the mode of 
collection. The relative root mean square error (RRMSE) 
must then be used instead of the CV to measure accuracy. 
However, for some characteristics households with 
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telephones have different expectations than households 
without telephones. In such situations the bias can again 
prevent achieving the desired accuracy. The bias for each of 
these variables was estimated by comparing NHIS 
responses from households with and without telephones. 
Table 5 shows how the number of states for which a 10 
percent RRMSE can be achieved varies by question, a 
function of the bias in telephone households and the 
telephone penetration rate in each state. 

Small areas with high telephone penetration rates, for 
characteristics with different expectations for telephone and 
non-telephone households, are better able to produce 
accurate estimates using an unbiased dual-frame estimator. 
Small areas with lower penetration rates, for characteristics 
with similar telephone and non-telephone households, 
produce more accurate estimates with a potentially biased 
dual-frame estimator. Using the appropriate dual-frame 
estimator for a given small area and characteristic can allow 
accurate estimates to be produced for a large percentage of 
small areas. 


Table 4 
The Number of States Able to Achieve 30-, 20-, 10-Percent CV With the 1995 NHIS Area Sample Only, With Unbiased 
Dual-Frame Estimation Using a RDD Supplement, or not at All, for Four Specific Variables 


CV Data sources Percent uninsured: 


Percent uninsured: 


Percent uninsured: Percent smokers: 


all ages under 19 low income children 18 and over 

30% With area sample only 42 Sil 31 46 
With RDD supplement 9 20 19 5 
Unable to meet requirement 0 0 1 

20% With area sample only 3 15 10 37 
With RDD supplement 19 35 40 14 
Unable to meet requirement 0) 1 1 0) 

10% With area sample only 8 2 Dy 15 
With RDD supplement 40 41 59 36 
Unable to meet requirement 3 8 10 0 

Table 5 


The Number of States Able to Achieve 10-Percent RRMSE With the 1995 NHIS Area Sample Only, With a RDD Supplement, 
or not at all, for the Four Specific Variables 


Percent uninsured: 


Data source all ages 
With area sample only 8 
With RDD supplement 
Unbiased Estimator 40 
Biased Estimator 30 
Unable to meet requirement 
Unbiased Estimator 3 


Biased Estimator 1133 


Percent uninsured: 
under 19 


Percent uninsured: Percent smokers: 


low income children 18 and over 
2 2 15 
41 39 36 
47 49 35) 
8 10 0 
D 0 l 
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6. IMPROVING POINT AND VARIANCE 
ESTIMATION 


When sufficient sample size exists to produce small area 
estimates there are additional steps that can be taken to 
improve their accuracy. SIPP does not stratify by state, to 
improve state estimates it reweights the estimates to control 
totals at the state level. This is very important when the 
stratification doesn’t match the analytic domains. The use 
of control totals also improves subpopulation (e.g., demo- 
graphic) size estimates for the small areas. However, it is 
not possible to control as many subpopulations in a small 
area as can be done at the national level, due to the smaller 
sample sizes. 

There are also many techniques to improve variance 
estimation for small areas. Typically there will be very few 
sampled PSUs in a given small area. This provides few 
degrees of freedom for estimating between-PSU (or total) 
variance. One solution is to average estimates of variance 
across small areas, but this covers up the fact that estimates 
are generally much better for some areas than for others. 
Alternatively generalized variance functions (GVFs) can be 
used to smooth variance estimates. 

A preferable solution is to address small area variance 
estimation at the design stage. Increasing the number of 
PSUs, with a corresponding reduction in sample size in 
each PSU, can significantly improve both point and 
variance estimation, often at little extra cost. Singh et al. 
(1994) suggested increasing the number of PSUs to control 
sample sizes in unplanned small areas. Remembering 
Fuller’s observation that “The client will always require 
more than is specified at the design stage,” it is impossible 
to anticipate all small areas of interest. By having more 
PSUs the likelihood is increased that actual data will have 
been collected from unplanned analytic domains. 

Kalton (1994) suggested a second reason for increasing 
the number of PSUs. His concern was that more PSUs per 
small area would greatly increase the stability of variance 
estimates. This is true even in very large national surveys 
with many PSUs. The NHIS was redesigned in 1995 
increasing the number of PSUs from 196 to 359. Of these 
359 PSUs 264 were noncertainty PSUs. This still resulted 
in only 7 states having more than 8 noncertainty PSUs. 
While direct variance estimation for individual states is still 
problematic for most states, there is an increased opportu- 
nity to develop average variance estimates for groups of 
states with common characteristics, rather than having to 
group all states together in a national average. 


7. SUMMARY 


There will always be a need for indirect small area 
estimation methods since the entire set of analytic domains 
is never known in advance. This need for small area 
estimates is growing around the world. There are, however, 
many actions that can be taken at the design stage to 
improve direct small area estimates, both point estimates 
and variance estimates. These steps include stratification 
consistent with known analytic domains, oversampling 
smaller areas, and increasing the number of PSUs. Given 
the data it is often possible to combine data from multiple 
years, from other surveys with whom questions have been 
harmonized, and through dual-frame estimation techniques. 
These steps will both reduce the need for indirect estimates 
and improve the accuracy of those estimates when they are 
required. 
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A Repeated Half-Sample Bootstrap and Balanced Repeated Replications 
for Randomly Imputed Data 


HIROSHI SAIGO, JUN SHAO and RANDY R. SITTER’ 


ABSTRACT 


In this paper, we discuss the application of the bootstrap with a re-imputation step to capture the imputation variance (Shao 
and Sitter 1996) in stratified multistage sampling. We propose a modified bootstrap that does not require rescaling so that 
Shao and Sitter’s procedure can be applied to the case where random imputation is applied and the first-stage stratum sample 
sizes are very small. This provides a unified method that works irrespective of the imputation method (random or 
nonrandom), the stratum size (small or large), the type of estimator (smooth or nonsmooth), or the type of problem (variance 
estimation or sampling distribution estimation). In addition, we discuss the proper Monte Carlo approximation to the 
bootstrap variance, when using re-imputation together with resampling methods. In this setting, more care is needed than 
is typical. Similar results are obtained for the method of balanced repeated replications, which is often used in surveys and 
can be viewed as an analytic approximation to the bootstrap. Finally, some simulation results are presented to study finite 
sample properties and various variance estimators for imputed data. 


KEY WORDS: Hotdeck; Percentile method; Monte Carlo; Imputation; Bootstrap sample size. 


1. INTRODUCTION 


Item nonresponse is a common occurrence in surveys 
and is usually handled by imputing missing item values. 
The various imputation methods used in practice can be 
classified into two types: deterministic imputation, such as 
mean, ratio and regression imputation, typically using the 
respondents and some auxiliary data observed on all 
sampled elements; and random imputation. In both cases 
the imputation is often applied within imputation classes 
formed on the basis of auxiliary variables. This article 
focuses on random imputation. 

Typically, random imputation is done in such a way that 
applying the usual estimation formulas to the imputed data 
set produces asymptotically unbiased and consistent survey 
estimators (e.g., means, totals, quantiles). More details 
about random imputation are provided in section 2. It is 
common practice to also treat the imputed values as true 
values when estimating variances of survey estimators. This 
leads to serious underestimation of variances if the pro- 
portion of missing data is appreciable, and to poor confi- 
dence intervals. 

There have been some proposals in the literature to 
circumvent this difficulty. For random imputation, Rubin 
(1978) and Rubin and Schenker (1986) proposed the 
multiple imputation method to account for the inflation in 
the variance, which can be justified from a Bayesian per- 
spective (Rubin 1987). Adjusted jackknife methods for 
variance estimation have been proposed for both random 
and deterministic imputations (Rao and Shao 1992; Rao 
1993; Rao and Sitter 1995; Sitter 1997), under stratified 
multistage sampling. However, it is well known that the 


jackknife cannot be applied to non-smooth estimators, e.g., 
a sample quantile or an estimated low income proportion 
(Mantel and Singh 1991). 

There are two methods available for handling randomly 
imputed data for both smooth and non-smooth estimators: 
the adjusted balanced repeated replication (BRR) methods 
proposed by Shao, Chen and Chen (1998); and the boot- 
strap method proposed by Shao and Sitter (1996) (see also 
Efron 1994) with a re-imputation step to capture the impu- 
tation variance. The bootstrap method is more computer 
intensive but is easy to motivate and understand, and 
provides a unified method that works irrespective of the 
imputation method (random or nonrandom), the type of 6 
(smooth or nonsmooth), or the type of problem (variance 
estimation or sampling distribution estimation). 

In this article we continue the work by Shao and Sitter 
(1996). First, we show in section 3 how Shao and Sitter’s 
bootstrap procedure can be modified to handle very small 
stratum sizes (e.g., two psu’s per stratum). Second, we 
discuss in section 4 the proper Monte Carlo approximation 
to the bootstrap estimators, a problem for which more care 
is needed when random re-imputation is applied than is 
typical. This has no detrimental effect on bootstrap confi- 
dence intervals based on the percentile method, but if done 
incorrectly, will cause the bootstrap-t to perform poorly. 
Third, we consider a BRR variance estimation method with 
a re-imputation step, which can be viewed as an analytic 
and symmetric approximation to the bootstrap method. 
Finally, we present some simulation results to study 
properties of various bootstrap and BRR_ variance 
estimators. 
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2. STRATIFIED MULTISTAGE SAMPLING AND 
RANDOM IMPUTATION 


Though the methods discussed in this article can be more 
generally applied, we restrict attention to the commonly 
used stratified multistage sampling design. Suppose that the 
population contains H strata and in stratum h, n, clusters 
are selected with probabilities p,,, i=1, ...,n,. Samples are 
taken independently across strata. In the case of complete 
response on item y, let 


Y, 2 Y,,/( 7, Py) 


be a linear unbiased estimator of the stratum total Y,, where 
Y ,; 1S a linear unbiased estimator of the cluster total Y,, for 
a selected cluster based on sampling at the Second ‘and 
subsequent stages. A linear unbiased estimator of the total, 


Y =LY,, is given by Y = DY,, which may be written as 


Le » Wik Yhik? (1) 
(hik)es 

where s is the complete sample of elements, and w,,, and 
Yn iz TeSpectively denote the sampling weight and the item 
value attached to the (hik) -th sampled element. 

Often a survey estimator, 6, can be expressed as a 
function of a vector of estimated totals as in (1). If one is 
interested in the population distribution function, it can be 
estimated by F Qiao, sWriz! Vnin S 2) /U, where /(-) is 
the usual indicator function and U'=2: >p¢ Some non- 
smooth estimators that are of interest are the p-th sample 
quantile, F~ '(p), where F ' is the quantile function of F, 
and the sample low income proportion F[(1/2 F~ (1/2) 

Suppose that the value y,,, is observed for (hik) Es_cs, 
termed a respondent, while for others, (hik)es,, it is 
missing, termed a nonrespondent, with s = Shs When 
there are missing data, it is common practice to use 
{ Yi: (Atk) €s,} to obtain imputed values y,, for 
(hik) €s,, and then treat these imputed values as if they 
were true observations and estimate Y with 


Gx pe Wik hie 2 Writ) nik: (2) 
5, Sm 
In practice, the accuracy of the imputation is improved 
by first forming several imputation classes using control 
variables observed on the entire sample, and then imputing 
within imputation class. For simplicity we consider a single 
imputation class. 
Random imputation entails imputing the missing data by 
a random sample from the respondents, or, in the presence 
of auxiliary data, by using a random sample of residuals. If 
the imputation is suitably done, the estimator Me in (2) is 
asymptotically unbiased and consistent, although it is not as 
efficient as Y in (1). Throughout this article, we assume 
that, either 


within each imputation cell, the response probability 
for a given variable is a constant, the response statuses 


for different units are independent, and imputation is 
carried out within each imputation cell and inde- 
pendently across the imputation cells, 

or 


within each imputation cell, the response probability of 
a given variable does not depend on the variable itself 
(but may depend on the covariates used for imputa- 
tion), imputation is carried out independently across the 
imputation cells, and within an imputation cell, impu- 
tation is performed according to a model that relates the 
variable being imputed to the covariates used for 
imputation. 


We also assume the same asymptotic setting as that in 
Shao et al. (1998). Thus, consistency (or asymptotic 
unbiasedness) refers to convergence of estimators (or 
expectations of estimators) under the assumption in Shao, 
et al. (1998), as the first-stage sample size n =2n, 
increases to infinity. 

There are many methods of random imputation. We 
consider only two in this article: the weighted hotdeck 
considered in Rao and Shao (1992), which we refer to 
simply as random imputation, and the adjusted weighted 
hotdeck proposed in Chen, Rao and Sitter (2000), which we 
refer to as adjusted random imputation. Our results can be 
easily extended to random imputation with residuals in the 
presence of auxiliary data (e.g., random regression impu- 
tation). Generalizations to other types of random imputation 
may be possible, but will not be considered here. 

Random imputation randomly selects donors, Y,,, from 
{ a (hik) €s,} with replacement with probabilities 

uly where T = Li Wrixe In this case E, OG We 


si 10 - ie a ratio estimator which is asymptotically 
unbiased and consistent for Y, where $ = y, Writ Ynit: Here E, 
denotes expectation under the random’ imputation. The 
variance of Y , iS larger than the variance of Vs because of 
the random imputation. However, the distribution of item 
values in the imputed data set is preserved. 

Adjusted random imputation simply uses i Tie 
Vee (S/T - S/T) as the > imputed values instead of y,,,, 
where 4,9=,25.0W, 1,5 debe oe Wrz and Y,, are the 
imputed values’trom random impitation. Chen et al. (2000) 
show that this method completely eliminates the variability 
due to the random imputation for estimating the population 
totals That, is.¥ ,=20.w, p 9), +2 Wri Mie = Y.. The 
method also Tees the distribution of item values i in the 
imputed data set. However, the resulting imputed values 
need not be actual realizations. 

An imputed estimator of the distribution function under 
random imputation is given by 


F(t) re > Wri! (Yin $ t) ci DE Wr cll See s | /0. (3) 


S, s m 


An imputed estimator of the distribution function under 


adjusted random imputation, denoted F(t), is simply 
obtained by replacing y,,, in (3) by 7),,,. For estimating the 
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distribution function, adjusted random imputation does not 
eliminate the imputation variance as it does for estimating 
the total. However, Chen et al. (2000) show that it does 
significantly reduce the imputation variance when 
compared to random imputation. Both F 7(t) and F y(t) are 
asymptotically unbiased and consistent. 

For studying variance estimation with resampling 
methods, we assume that n/N is negligible, where 
n=n,,N=ZN, and N, is the number of first-stage 
clusters in the population. 


3. A REPEATED HALF-SAMPLE BOOTSTRAP 


When there are imputed missing data, naive bootstrap 
variance estimators obtained by treating the imputed data 
set, Y,, as Y = {y,,,: (Aik) € s}, the data set of no missing 
values, do not capture the inflation in variance due to impu- 
tation and/or missing data and lead to serious under- 
estimation. As a result, they are inconsistent. This is so, 
because simply treating Y, as Y ignores the imputation 
process. This was noted by Shao and Sitter (1996) and they 
proposed re-imputing the bootstrap data set in the same way 
as the original data set was imputed. The bootstrap proce- 
dure in Shao and Sitter (1996) can be described as follows. 


1. Draw a simple random sample whe Lsalirne ii ty 
with replacement fromthe sample {y,,:i=1,...,n,}, 
h =1,...,H, independently across the strata, where 
9, Oe Goi DesjuO eh, 7 jes}. 

24 yet a ji be the response indicator associated with 


Vrije Sm = (A, iJ): aga = O} and $s. 
{(A, i, ni): Aig = 1}. Apply the same imputation proce- 
dure used in constructing the imputed data set Y, to the 
Bierce BODES DS in s,, using the “respondents” in 


*, Denote the bootstrap analogue of Y, by Y,. 


3 Obtain the bootstrap analogue 6; of 6, based on the 
imputed bootstrap data set Y7 Bor example, if 6=Yin 
(1) and oe = Ye in (2), then 


ok Wrik Ynix * 2 Writ Y hik? (4) 


where y,,, is the ee value using the bootstrap data 
and w,,, is n,/(n, - 1) times the survey weight associated 
with y,,, (to reflect the fact that the bootstrap sample size 
is n, — 1, not n,). The bootstrap estimator of Var (8,) is 


De.) =2V.ats (0, ), (5) 


where Var * 
given Y,. 

Shao and Sitter (1996) show that the bootstrap estimator 
defined in (5) i is consistent for both smooth and nonsmooth 
estimators 6. When a random imputation method is consi- 
dered, an implicit condition in their development is that 
n,/(n, — 1) goes to 1. This can be seen from the special 
case of 6 = Y. From (2), 


is the conditional variance with respect to Y,, 
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Vara) Var E,(Y)| + E|Var, eA) 


De Wrik Rees, Wrik 


Be ee nf emia) 
dewis re (6) 


m 


where 


olgn 2 

Muh es Wri Vnix ~¥,) is Whig» 
5, 5, 

yee S Wik re Wrik 
3: j; 


Similarly, by (4), 
yy Wrik De Wrik 


Var 0%) =Var*| — 


»Y ws; 
Whik 
oa 


* 


gr > vs : (7) 


where 


ax? _ * * = * * 
ae yy Wrz (Ynik ~ Yr) /x Whik> 
Oe ce 

* * * 
: ys Wie Dra Whik - 
5. 5. 


From the theory of the bootstrap, the first terms on the right 
hand side of (6) and (7) converge to the same quantity, as 
do 6? and 6**. Thus, Shao and Sitter’s bootstrap is 
consistent if © . Os and ). Wik converge to the same 
quantity, which 1 i§ true if n,/ (nn - 1) converges to 1 for all 


h, because 
ap “a =E j= Gi “ii 
=“ya 


a Onin) Writ n,/(n, ~ 1). 


The second term on the right hand side of (6) is the variance 
component corresponding to random imputation, which is 
typically a small portion of the overall variance. Thus, the 
overestimation due to n,/(n, - 1) is serious only when the 
n,’S are very small. The case n, =2 is, however, an 
important special case. 

We now propose a bootstrap method which has no 
difficulty in the case of very small n,’s while remaining 
valid more generally. Note that the use of bootstrap sample 
size n, — | is to ensure that the first term on the right hand 
side of (7) has the same limit as the first term on the right 
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hand side of (6) (Rao and Wu 1988). When n, is used as 
the bootstrap sample size in stratum h, Rao and Wu (1988) 
showed that in the case of no missing data, the bootstrap 
variance estimator underestimates. They proposed a 
rescaling to circumvent the problem, but rescaling does not 
produce correct bootstrap estimators in the presence of 
imputed data. 

What is ideally required for our problem is a bootstrap 
method with the bootstrap sample size equal to the original 
sample size n, which produces an asymptotically unbiased 
variance estimator (in the case of no missing data) without 
rescaling. We now show that this can be accomplished as 
follows. Suppose that there is no missing data and that all 
of the n, = 2m,’s are even. Take a simple random sample 


of size m, without replacement independently from 
{y,;: 1 =1,...,2, } and repeat each obtained unit twice to 
get {y,:7= iW n, }. We call this method the repeated 


half-sample alee The resulting v, will then be 
approximately unbiased and consistent. ia the linear case 
where Y = Dee el ee Yai, = X,Y, and 
Vni =p sane ee the PGi ace of v, follows from 


fee 
— yi 
nN, i=1 


Var() j=) Vals (Oy ace 
h h 


the usual approximately unbiased and consistent estimator 
of variance, where s, =(n, - Ly Dar LPO Pry ye The 
consistency of v, fora sloniinean 6, follows from the linear 
case and Taylor’ S expansion, tas 6, is a function of 
weighted averages, or the arguments used j in Shao and Rao 
(1994), Shao and Sitter (1996), and Shao et al. (1998) when 
6, is non-smooth such as a median. 

If n, = 2m, + 1 is odd, it is not possible to take an exact 
half-sample. In this case, the following two results lead us 
to an adaptation of the above idea: 


i) If we choose a simple random resample of size 
m, =(n, ~ 1)/2 without replacement and repeat each 
unit twice, we end up with n , ~ | units. If we obtain an 
additional unit by selecting one at random from the 
| units already resampled, Var *(Y") = 
L(n, +3)s; Inj; 


ii) If we choose a simple random resample of size m, + 1 
without replacement and repeat each unit twice, we end 
up with n, +1 units. If we discard one of these at 
random, Var *(¥*) = ©, (n, - 1)s,/ny. 


Thus, if we used method (i) with probability 1/4 and 
method (ii) with probability 3/4 at each bootstrap repli- 
cation, we obtain the desired result. This repeated half- 
sample bootstrap method yields approximately unbiased 
variance estimates without rescaling and has a bootstrap 
sample size equal to the original sample size. Thus, if we 
use this bootstrap for Step 1 of the method of Shao and 
Sitter (1996) as described above, the resulting bootstrap 
estimators are asymptotically unbiased and consistent for 
any n,, under the regularity conditions stated in Shao and 
Sitter (1996) and Shao et al. (1998). 


4. THE PROPER MONTE CARLO FOR THE 
BOOTSTRAP 


If v p in (5) has no explicit form, one may use the Monte 
Carlo approximation 


v, 6, Oe iy 6 pate 07), (8) 


where a a BD ato Olin haan 
Yj) 0 =1,...,B, are independent re-imputed bootstrap 
data sets. It is common practice in many applications of the 
bootstrap to replace the average of the bootstrap estimators 0, 
in (8) by the original estimator 6, (see Rao and Wu 1985, 
page 232). The latter is simpler 6 use and is thus the most 
common. With no imputed data, this is usually correct. 
However, using the analogue with the re-imputed bootstrap 
is not correct. The reason is that 6 , 1s the result of a og 
realization of the random imputation, while 0; 
Be 6; Cyrene (6,) since we are averaging over reopened 
re- drapuliatitnis: and 6, and E 6,) are not close for random 
imputation. When o- ya ss example, E He ene Vi given 
in section 2 and ihe difference Y,- i is not a relatively 


I 
negligible term when random imputation is used. Thus, 


and the first term goes to Var *(6 )as B > - but the second 
term does not go to zero which implies that v,, badly over- 
estimates the variance. This is not only true for the 
proposed repeated half-sample bootstrap but also for those 
considered in Shao and Sitter (1996). 

One should also note that using the 0; b)? bee Bato 
obtain bootstrap confidence intervals via the percentile 
method avoids this concern since the histogram of these 
values will be correctly centered about E * (6 " ). However, 
one must take more care with bootstrap-t confidence 
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intervals. It is important that one define 
ee = (Bj) - 87())/, (not ty (Oj = 8/95) and use 
(6, tp, -t/6,}, where o,°= we @ ”), th = =CDF, (a), 
{CDE ele a) and CDF (Oe tiihcx Daal. _BYIB. 


5. A REPEATED BRR 


We first describe the most common application of the 
BRR, n ne 2 clusters per stratum (McCarthy 1969) in the 
setting of no missing data. A set of B balanced half-samples 
or replicates is formed by deleting one first-stage cluster 
from the sample in each stratum, where this set is defined 
bya Bx H matrix (6,, )p., with 6,, = +1 or -1 according 
to whether the first or the second 13 bn cluster of 
stratum h is in the b-th half-sample and bear 1 = 0 for 
all h # h’; that is, the columns of the matrix ate eed 
A minimal set of B balanced half-samples can be 
constructed from a B x B Hadamard matrix by choosing 
any H columns excluding the column of all +1’s, where 
H+1l <8 2H 4,1 Let on be the survey estimator 
computed from the b-th halts sui The estimator 6 p) can 
be obtained using the same formula as for 6 with W rik 
changed to w,,,,,, which equals 2w,,, or 0 according to 
whether or not the (hi)-th cluster is selected in the b-th 
half-sample or not. The BRR variance estimator for 6 is 
then given by 


Ely. (8, - 8.) (9) 
“BRR p 6) YO)? 


where 8.) = 8/8, and is often replaced by 6. The 
variance estimator Uzp, has been shown to be consistent 
for smooth functions of estimated totals by Krewski and 
Rao (1981) and for nonsmooth estimators by Shao, and Wu 
(1992) and Shao and Rao (1994). 

A naive BRR for problems with randomly imputed data 
would be obtained as in (9) with 6, ) and 0 x replaced by 6 1(b) 
and @ (paw pe Ae where Be is the estimator calcu- 
lated from Y, using the BRR weights. But this produces 
inconsistent Vananes estimators because it fails to take into 
account the effect of missing data and the random 
imputation. 

To correctly apply the BRR in the presence of random 
imputation by using re-imputation, we must deal with the 
issue of n, being small. Recall that for the bootstrap such 
small n,’s caused difficulty because the stratum resample 
size, n, - 1, was smaller than the original stratum sample 
size, n,. This is true for the BRR, as well. We propose an 
easy way to circumvent this difficulty. Rather than 
obtaining the b-th BRR replicate of the estimator, OF , from 
the same formula as for 6 but with weights Wrigp) equal 
2w,,, Or 0 according as to whether the (hi)-th cluster is 
selected in the b-th half-sample or not, instead use the 
original weights but include the (hi)-th cluster twice or not 
at all according as to whether the (h/)-th cluster is selected 
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in the b-th half-sample or not. If we view the BRR in this 
way: 1) the resulting U,,, in (9) remains the same; and ii) 
the resample size is the same as the original sample size. 
This repeated BRR can be viewed as a type of balanced 
bootstrap, however one should note that the balanced 
bootstrap described in Nigam and Rao (1996) for the case 
of no missing data does not work in this case because, 
though it uses a resample size n, = 2 in each stratum, it 
does so in such a way as to still require rescaling and thus 
will not work in the presence of random imputation. 

The proposed repeated BRR has no difficulty in the 
presense of random imputation. The procedure becomes 


1. Form the set of half-samples, 1 unit per stratum, using 
a Hadamard matrix as described above. 


2. Obtain the b-th BRR replicate by repeating each unit in 
the obtained half-sample twice. Denote this 
LWunt ey WN, = 2}. 

3 Let Anis be the response indicator associated with 
Dien = {(h,i,j): ay; = 0}, and SS = {(h, i,j): Ay, = late 
Apply the same imputation procedure used in 
constructing Y, to the units in s,, using the 
“respondents” in s,’. Denote the b-th BRR replicate of Y : 
by Vig): 

4. Obtain the BRR analogue 6" 
imputed BRR data set Y;,,). 


16) of 6, based on the 


4. Repeat 1-4 for each row of the Bx H matrix to get 
On , for b=1,...,B and apply the standard BRR 
Farnule 0 to Goran BRR variance estimators for OF 
with Be be »Or6) (For the same reason that is 
discussed in section 4, we should not replace 

9, by 9,). 

We can extend this idea to cases with n, > 2 by using the 
same strategy with half-samples obtained from balanced 
orthogonal multi-arrays (BOMA’s) (Sitter 1993). For 
example, Table | gives a set of B = 24 balanced resamples 
for H = 7 strata with n, = 4 psu’s in each stratum. It is 
derived using the BOMA given in Table 1 of Sitter (1993) 
and repeating each resampled unit twice as in Step 2 above. 
Using a BOMA in Steps 1 and 2 of the procedure above 
also results in an approximately unbiased variance esti- 
mator. BOMA’s are fairly easily constructed for even n, 
using balanced incomplete block designs and Hadamard 
matrices, but are difficult to construct for odd n,. They can 
also handle unequal n,’s for different strata, though con- 
struction becomes a more serious problem (see Sitter 1993). 


6. A SIMULATION 


To study the properties of the proposed resampling 
variance estimators, we consider a finite population of H = 
32 strata with N, clusters in stratum A and ten ultimate units 
in each cluster. The characteristic of interest y,, are 
generated as follows: 
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Table 1 
A Set of Balanced Resamples Constructed from a BOMA 


b 1 2 3 

] CIGIES5S) CIES 3) CES33)) 
2 (1,1,4,4) (1,1,4,4) (1,1,4,4) 
3 Cerr2:2) CE 232) CIE) 
4 (2,2,4,4) SS) (2,2,4,4) 
5 (23"3)) (1,1,4.4) (CAPA) 
6 (3,3,4,4) IED) (3,3,4,4) 
7 (2,2,4,4) (2,2,4,4) (e373) 
8 (A235) (2323355) (1,1,4,4) 
9 (3,3,4,4) (3,3,4,4) CRIED) 
10 (e353) (2,2,4,4) (2,2,4,4) 
1] (1,1,4,4) (A223)53) (225353) 
12 15252) (3,3,4,4) (3,3,4,4) 
13 (ds 1F33) (15333) IES*3) 
14 (1,1,4,4) (1,1,4,4) (1,1,4,4) 
15 Gil) Gl 2) Cg 252) 
16 (2,2,4,4) (GIL sis) (2,2,4,4) 
17 *(2;2,353) (1,1,4,4) (27253'3) 
18 (3,3,4,4) (1,1,2,2) (3,3,4,4) 
WY) (2,2,4,4) (2,2,4,4) lle 33) 
20 (22,355) (22333) (1,1,4,4) 
2) (3,3,4,4) (3,3,4,4) e252) 
Py) CEES" 3) (2,2,4,4) (2,2,4,4) 
DS (1,1,4,4) (232,353) (252-33) 
24 GEEZ s2) (3,3,4,4) (3,3,4,4) 


Viik Oe Rik? 


where iy, eek (a5 o;) independent of 6,,~ 
N(O,[1-p]o,/p) and the parameter values are those 
given in Table 2. For a particular value of the intracluster 
correlation, p, a single finite population was thus generated 
and then fixed and repeatedly sampled from. Each simu- 
lation consisted of selecting n, = 2 clusters with replace- 
ment from stratum h for h = 1, ..., H and enumerating the 
entire cluster. Each ultimate unit in the obtained cluster was 
independently declared a respondent or nonrespondent with 
probability p and (1-p) respectively, i.e., uniform 
response. The nonrespondents were then imputed both 
using random imputation and adjusted random imputation 
and the population total and distribution function, for 
various values of F(t), were estimated. Two values of 
p,0.1 and 0.3, and two values of p, 0.6 and 0.8, were 
considered. Note that the first-stage sampling fraction is 
quite small (0.064), so that with-replacement and without- 
replacement sampling are essentially equivalent. 

To compare the performance of the different variance 
estimators we calculated the percent relative bias and rela- 
tive instability for each, defined as 


S 
%RB = ‘= v,6,)/MSE@,) 


s=l 


and 


h 
4 5 6 V 
LF 1323) @ue3;3) (ipe333) C5353) 
(1,1,4,4) (1,1,4,4) (1,1,4.4) (1,1,4,4) 
(5152;2) (iE 15252) @Ql2.2) G22) 
(gi 33) (2,2,4,4) (1,1,3,3) (2,2,4,4) 
(1,1,4,4) (223353) (1,1,4,4) (2,25333) 
(Gt 222) (3,3,4,4) (E1232) (3,3,4,4) 
GIES 53) (2,2,4,4) (2,2,4,4) (Hc t53)53)) 
(1,1,4,4) 2727353) (227353) (1,1,4,4) 
(ahh), (3,3,4,4) (3,3,4,4) (ee) 
@EIR353)) (1516353) (2,2,4,4) (2,2,4,4) 
(1,1,4,4) (1,1,4,4) (225373) (22333) 
Ci512,2) (E22) (3,3,4,4) (3,3,4,4) 
(2,2,4,4) (2,2,4,4) (2,2,4,4) (2,2,4,4) 
(72,353) (2327353) (224333) (2;253;3) 
(3,3,4,4) (3,3,4,4) (3,3,4,4) (3,3,4,4) 
(2,2,4,4) (1515353) (2,2,4,4) GIEIE3#S) 
(2327353) (1,1,4,4) (22273;3) (1,1,4,4) 
(3,3,4,4) 52,2) (3,3,4,4) CGT 22) 
(2,2,4,4) (CIRI353) e533) (2,2,4,4) 
(2527353) (1,1,4,4) (1,1,4,4) (2223353) 
(3,3,4,4) (1,1,2,2) (1,1,2,2) (3,3,4,4) 
(2,2,4,4) (2,2,4,4) (CEA 353)) (AR T;353) 
(2523333) (222,559) (1,1,4,4) (1,1,4,4) 
(3,3,4,4) (3,3,4,4) (1G 252) @p2 2) 


s Y, 
RI = {3 Y [v,6,) - mse) if if MSE(6,), 
s=1 


respectively, where the number of simulation runs was S = 
5,000 and the true MSE(6 ») was obtained through an 
independent set of 50,000 simulation runs. The bootstrap 
variance estimators were each based on B = 2,000 bootstrap 
resamples. We obtain results for estimating the variance of 6 F 
equal to the imputed total and the imputed distribution 
function using: (1) the repeated half-sample bootstrap with 
proper Monte Carlo approximation, v,, as in equation (8) 
and with improper Monte Carlo approximation replacing 
6 ha with 6 p denoted v,,; and (ii) the proper repeated 
BRR, Ugpp, aS in equation (9) and the improper repeated 
BRR replacing 0 1) with 6,, denoted Vgpp>- 

Table 3 summarizes the results for percent relative bias 
using random imputation and adjusted random imputation. 
Note that adjusted random imputation is not presented for 
estimating the population total, Y, as adjusted random 
imputation removes the imputation variance from the esti- 
mator and thus simpler methods of variance estimation are 
available (Chen et al. 2000). It is clear from the high %RB 
for Vp, and Ugppy that one must not replace @,,. and 8;,, 
by 0, in the bootstrap or the BRR, respectively. It is also 
clear that both the repeated half- sample bootstrap and the 
repeated BRR variance estimators, v , and v have 


BRR 
negligible bias when properly applied. 
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Table 2 Given the results of Table 3, we consider relative 
Parameters of the Finite Population instability, RI, only for v, and v,,,. We also restrict our 
po Pe : x oa : : presentation to p = 0.3 and p = 0.6 as the RI results were 
E 2 ‘ 2 x qualitatively the same in the other three cases. These results 
: 7 be sp " ; Oe a ‘ are given in Table 4. As one can see, though the differences 
: are small, v, 1s slightly more stable than v,,,. This was 
3 20 ISO US I Bil 130 = 13.0 B BRR 
Gt Ben 16.6 10 34 «120 «12.0 generally the case for all values of p and p. We also 
5 2 165 165 21 34 «+2110 110 included the adjusted jackknife of Rao and Shao (1992) and 
6 25 190 19.0 22 34 100 10.0 the adjusted BRR of Shao et al. (1998) in simulations for 
7 55 ep ™ 186 5334 150° 15:0 8=Y and v, again was uniformly more stable. For 
SP g28> 70a 110 DAO) ae example, with p = 0.3 and p = 0.6 as in Table 4, RI for the 
9228.05 605 1.16.0 251d) 27 lal 004 110.0 adjusted jackknife and the adjusted BRR were both 0.27. 
IQ 5228) Mal 807 S1S:0 LONE -ASON MSO This may be because the reimputation approach has an 
1 pdt 2120 ea 7.0. 270 37) 125 12.5 advantage in estimating the component of the variance due 
12 31 160 16.0 28 39-100 :10.0 to the imputation against the adjustment approach, provided 
LS hippie peas 15:0 An eee catee orate the resample size is large enough to eliminate Monte Carlo 
Leis dehy ee nde DOM iments error as is the case in our simulations. But, when the 
15 31 FAO) SGA 31 42 1 eS : 


frre Fea 6.6 32«42«08~=—T75 number of reimputations 1s moderate (like in the BRR with 
reimputation or the bootstrap with B = 1,000), this 
advantage is not entirely realized. 


Table 3 
Jo RB for Vg, Vp>, Veep and UpRR? 
Random imputation Adjusted random imputation 
Estimand UBRR UBRR2 Up Vp2 UBRR UBRR2 UB Veo 
p=0.1 and p =0.6 
4 0.00 21.54 0.79 21.60 
F(t) = 0.0625 -1.09 15.92 -0.52 15.88 0.46 19.64 1.24 19.51 
F(t) = 0.2500 -0.13 19.44 0.62 1ONSS 0.85 14.86 1.80 15.08 
F(t) = 0.5000 -0.36 21.68 OS DAES 0.55 10.73 1.24 10.76 
F(t) = 0.7500 -0.84 19.89 0.13 20.09 -0.36 10.98 0.54 IESY 
F(t) = 0.9375 0.05 21.92 0.57 21.66 0.81 19.12 1239 18.91 
p=0.1 and p =0.8 
MWC -0.63 15.06 0.36 i5),3)7/ 
F(t) = 0.0625 -1.99 10.30 -1.72 10.16 -1.65 10.97 -1.08 WL} 
F(t) = 0.2500 -1.27 13.65 -0.88 13.30 -0.95 8.89 -0.52 8.81 
F(t) = 0.5000 -0.72 15.26 0.02 15.26 -0.12 6.58 O05 6.53 
F(t) = 0.7500 -0.37 14.50 0.57 14.76 0.36 7.56 1.05 7.81 
E@)=019375 -0.14 16.16 0.75 16.36 0.56 13.04 eae 13.08 
p =0.3 and p =0.6 
ve 0.25 21.34 0.78 21.09 
F(t) = 0.0625 -1.39 11.45 -0.86 LOS 7 -0.35 15.38 0.64 15.64 
F(t) = 0.2500 -0.41 19.89 0.14 19573 1-23 13.79 1.71 13.62 
F(t) = 0.5000 -0.10 20.25 0.37 19.89 0.29 8.97 0.78 8.88 
F(t) = 0.7500 -1.40 16.70 -0.49 16.89 -0.75 9.24 0.07 9.49 
F(®)=0:9375 0.71 17.78 1.03 eS yl 0.91 15.07 1.34 15.04 
p=0.3 and p=0.8 
BY, 0.01 22 0.93 eos 
F(t) = 0.0625 -1.09 7.54 -0.56 7.69 -1.24 8.64 -0.35 9.07 
F(t) = 0.2500 -0.44 S42) -0.08 14.99 -0.23 8.18 0.29 8.23 
F(t) = 0.5000 0.05 14.92 0.71 14.84 0.43 6.21 0.86 6.20 
F(t) = 0.7500 0.13 12.54 0.86 12.70 0.81 6.85 1.26 6.99 


F(t) = 0.9375 1.62 Sal 2.06 13.01 1.86 11.04 2.34 11.02 
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Table 4 
with p = 0.3 and p = 0.6 
Adjusted random 


RI for v, and Upgep 


Random imputation 


imputation 
Estimand Deer Ds ORee Vp 
i O27 0.23 
F(t) = 0.0625 0.60 0.59 0.57 0.56 
F(t) = 0.2500 0.35 0.32 Ory, 0.35 
F(t) = 0.5000 0.27 0.23 0.28 0.26 
F(t) = 0.7500 0.29 0.26 0.30 0.28 
F(t) = 0.9375 0.48 0.46 0.48 0.46 


7. CONCLUSION 


We proposed repeated half-sample bootstrap and 
balanced repeated replication methods for variance estima- 
tion in the presense of random imputation that capture the 
imputation variance by reimputing for each replication 
using the same random imputation method as in the original 
sample. These repeated half-sample methods are valid in 
stratified multi-stage sampling, even when the number of 
psu’s sampled in each stratum is very small, e.g., 2. The key 
is that these methods use a stratum resample size that is 
equal to the original sample size without resorting to 
rescaling. These provide a unified method that works 
irrespective of the imputation method (random or non- 
random), the stratum size (small or large), the type of 
estimator (smooth or nonsmooth), or the type of problem 
(variance estimation or sampling distribution estimation). It 
is important to note that using reimputation to capture the 
imputation variance requires that one take greater care in 
the definition of the BRR and the Monte Carlo approxi- 
mation to the bootstrap variance. In both cases it is 
important to use the mean of the replicates in the definition 
as opposed to replacing it with the estimator applied to the 
original sample. 
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Local Polynomial Regression in Complex Surveys 


D.R. BELLHOUSE and J.E. STAFFORD! 


ABSTRACT 


Local polynomial regression methods are put forward to aid in exploratory data analysis for large-scale surveys. The 
proposed method relies on binning the data on the x-variable and calculating the appropriate survey estimates for the mean 
of the y-values at each bin. When binning on x has been carried out to the precision of the recorded data, the method is the 
same as applying the survey weights to the standard criterion for obtaining local polynomial regression estimates. The 
alternative of using classical polynomial regression is also considered and a criterion is proposed to decide whether the 
nonparametric approach to modeling should be preferred over the classical approach. Illustrative examples are given from 


the 1990 Ontario Health Survey. 


KEY WORDS: Covariates; Exploratory data analysis; Kernel smoothing; Regression. 


1. INTRODUCTION 


Following Fuller (1975), multiple linear regression tech- 
niques have been studied and used extensively in sample 
surveys. At least three chapters of Skinner, Holt and Smith 
(1989) are devoted to this subject. Here we restrict attention 
to the case in which there is one covariate x for the variate 
of interest y so that we could consider polynomial regres- 
sion as well as simple linear regression. In this context we 
could also consider the nonparametric approach of local 
polynomial regression, which, for the case of independent 
and identically distributed random variables, is described in 
Hardle (1990), Wand and Jones (1995), Fan and Gijbels 
(1996), Simonoff (1996) and Eubank (1999). Using the 
survey weights, Korn and Graubard (1998) introduced the 
use of local polynomial regression for graphical display of 
complex survey data. However, they did not provide any 
statistical properties for their procedures. Smith and Njenga 
(1992) used regression kernel smoothing techniques to 
obtain robust estimates of the mean and regression para- 
meters for an assumed superpopulation model. Here we use 
local polynomial regression as an exploratory tool to dis- 
cover relationships between y and its covariate x. 

We assume that the covariate x is measured on a 
continuous scale. Due to the precision at which the data are 
recorded for the survey file and the size of the sample, there 
will be multiple observations at many of the distinct values. 
This feature of large-scale survey data has been exploited 
by Hartley and Rao (1968, 1969) in their scale-load 
approach to the estimation of finite population parameters. 
Here we exploit this same feature of the data to examine the 
relationship between y and its covariate x. In recognizing 
that the data may be naturally binned to the precision of the 
data, we can consider taking a further step by constructing 
larger bin sizes. Under this approach we examine the effect 


1 


of the sampling design on estimates and second order 
moments. 

Suppose that in the finite population of size N, x has k 
distinct values so that natural binning has taken place, or 
that x has been categorized into k bins that are wider than 
the precision of the data. Let x, be the value of x repre- 
senting the i™ bin, and assume that the values of X, are 
equally spaced. The spacing or bin size b = x, - x,_,. The 
finite population mean for the y-values at x, is y,. We 
assume that a sample of size n taken from this population 
has the same structure as the population in that there are k 
bins. From the sample data we calculate the survey estimate 
of y, of y,- The finite population proportion of the 
observations with value x, is denoted by p,. This pro- 
portion is estimated by the survey estimate p,. We assume 
that y, and fp, are asymptotically unbiased, in the sense of 
Samdal, Swensson and Wretman (1992, pages 166-167), 
for y, and p, respectively. The survey estimates y, for 
i = 1,...,k have variance-covariance matrix V. On consi- 
dering the distinct values x, as domains, the estimated 
variance-covariance matrix V may be obtained easily 
through survey packages such as SUDAAN and STATA. 

There are several advantages to binning the data on the 
covariate x for exploratory data analysis: 


— For large surveys, a plot of y, against x, may be more 
informative and less cluttered than a plot of the raw 
data. 


— By appealing to a finite population central limit 


theorem on y, and imposing a superpopulation 
assumption on y,, a relatively simple model for y, may 
be assumed so that the analyst may easily focus on the 
central issue considered here, determination of the 
trend function in x. 
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—~ Once V has been obtained, then a wide variety of 
powerful exploratory data analyses can be easily 
carried out in languages such as S-Plus. Working with 
the raw data requires continued appeals to SUDAAN 
or STATA for the appropriate variance estimates. 


— By binning the data, an approach to regression analysis 
is obtained that provides a parallel to other nonpara- 
metric approaches to survey data analysis. For 
example, in categorical data analysis obtained initially 
by Rao and Scott (1981), in the logistic regression 
approach of Roberts, Rao and Kumar (1987) or in the 
generalized linear model approach of Bellhouse and 
Rao (2000), the tests and associated distributions are 
obtained through survey estimates of domain means or 
proportions. 


For the superpopulation, we assume that we have a 
model such that E, (y,) = m(x;), where E,, is the super- 
population expectation. We assume further that as we move 
to a continuum of values on x, then m(x) is a smooth 
function. The function m(x) is the ultimate function of 
interest for estimation. In section 2 we provide local poly- 
nomial regression methods to estimate m(x). These 
methods are applied to data from the 1990 Ontario Health 
Survey in section 3. In section 4, the question is asked: 
would the classical polynomial regression techniques have 
served equally as well in modeling m(x)? Some future 
directions for this work are given in section 5. Generally, 
we adopt the notation of Wand and Jones (1995) in 
discussing local polynomial regression here. 


2. BASIC METHODOLOGY 


For local polynomial regression, the nestimate of m(x) 
at any value of x is obtained upon minimizing 


k 
Dy BAY, By — B, @;, -*)---. 
i=l 


- B, (x, - x)" K((x, -x)/h) /h (1) 
with respect to Bos B,, Le, B,- The values that minimize (1) 
are denoted by Bo by bof oe Further, for the given value 


of x, m(x) = iy In (1), the kernel K(t) is a symmetric 


function with [K(t)dt =1, [tK(t)dt=0, 
O<ft?K()dt<~ and 
R(K) = [[Kofar < ©. (2) 


Also in (1), 4 is the window width of the kernel. In mini- 
mizing (1) to obtain local polynomial regression estimates, 
there are two possibilities for binning on x. The first is to 
bin to the precision of the recorded data so that y, is 
calculated at each distinct outcome of x. In other situations 
it may be practical to pursue a binning on x that is rougher 
than the accuracy of the data. 


In moving from the sample to the population we 
maintain the same window width h. This is in contrast to 
Breidt and Opsomer (2000) and Buskirk (1999) who 
assume a smoothing parameter h,, for smoothing in the full 
finite population. In the context here, this would yield a 
function m, (x), the finite population smoothed version of 
the y, with smoothing parameter h,,, as a finite population 
parameter of interest followed by m(x) the hypothetical 
smooth function under the asymptotic assumptions. We 
have kept / constant in view of the way in which binning 
that has been done; the bin structure is the same in the 
sample as in the population. The choice of the smoothing 
parameter h depends on the spacing of the x’s and the 
variation in the data (Green and Silverman 1994, pages 
43-44). The spacing of the covariate is usually dominant in 
the determination of h. Since the spacing has been kept 
constant from sample to finite population with the spacing 
changing only when the asymptotic assumptions are 
applied, we keep hy =h. 

Korn and Graubard (1998) provide a slightly different 
objective function to (1). They replace the sum over the 
bins in (1) by the sum over all sampled units and ), in (1) 
by the sample weights. Korn and Graubard’s objective 
function reduces to (1) plus a term that involves the 
weighted sum of squares of deviations of sample observa- 
tions from the binned means where the weights are the 
sample weights scaled to sum to one. Consequently, the 
estimate of m(x) is the same in both cases. 

The estimate m(x) and its first two moments can be 
expressed in matrix notation. The forms are exactly the 
same as those that appear, for example, in Wand and Jones 
(1995, chapter 5.3) whose notation we have adopted. Let 
the vector of finite population means at the distinct values 
of xbe y =(Y,--¥,)' and let ¥ be its vector of survey 
estimates. Further, let 


Xe nae 

IP Ses ae (xgiaix)t 
one 2 2 

Leena pax)? 


and 


Ww . diag (p, K ((x, -x)/h), 


x 


P, K ((x, -x)/h), “ DK (x, - x)/h)). 
The matrix W. is W, with p replaced by p. Then 


a(x) = e(X,W, X,) 1X, WY, (3) 


where e is the k x1 vector (1,0,0,..,0)". The approxi- 
mate design-based expectation of m(x) is 


E, (mi(x)) = e'(X,W, X,)" Xi Wy, (4) 
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where E, denotes expectation with respect to the sampling 


design. We can also consider (4) as a smoothed estimate of 
m(x) so that m(x) is also an estimate of m(x). In the deri- 
vation of (4) we note that E,(Y) =y and E,(W,) = W, for 
large sample size n. Further, in (3) we can write 
W. = Wo A, where A = W. - W.. We use the first two 
terms in the expansion (I + B)'=I-B+B?-B*+-. as 
an approximation to complete the derivation. Using the 
same techniques, the approximate design-based variance is 
given by 


V (m(x)) = e? (XW, X,) 1X) W,VW,X, 


(Xi W.X,) Te. (5) 


The results in (4) and (5) were obtained ignoring higher 
order terms in 1/n. An estimate of the variance L (m(x)) 
is obtained on substituting the survey estimate V for V and 


A 


W. for W, in (5). 


3. EXAMPLES FROM THE ONTARIO HEALTH 
SURVEY 


We illustrate local polynomial regression techniques 
with data from the Ontario Health Survey (Ontario Ministry 
of Health 1992). This survey was carried out in 1990 using 
a stratified two-stage cluster sample. The purpose was to 
measure the health status of the people of Ontario and to 
collect data relating to the risk factors of major causes of 
morbidity and mortality in Ontario. The survey was 
designed to be compatible with the Canada Health Survey 
carried out in 1978-79. A total sample size of 61,239 people 
was obtained from 43 public health units across Ontario. 
The public health unit was the basic stratum with an 
additional division of the health unit into rural and urban 
strata so that there were a total of 86 strata. The first stage 
units within a stratum were enumeration areas taken from 
the 1986 Census of Canada. An average of 46 enumeration 
areas was chosen within each stratum. Within an enume- 
ration area, dwellings were selected, approximately 15 from 
an urban enumeration area and 20 from a rural enumeration 
area. Information was collected on members of the house- 
hold within the dwelling. 


Several health characteristics were measured. We focus 
on one continuous variable from the survey, Body Mass 
Index (BMI). The BMI is a measure of weight status and is 
calculated from the weight in kilograms divided by the 
square of the height in meters. The index is not applicable 
to adolescents, adults over 65 years of age and pregnant or 
breastfeeding women. The measure varies between 7.0 and 
45.0. A value of the BMI less than 20.0 is often associated 
with health problems such as eating disorders. An index 
value above 27.0 is associated with health problems such as 
hypertension and coronary heart disease. Associated with 
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the BMI is another measure, the Desired Body Mass Index 
(DBMI). The DBMI is the same measure as BMI with 
actual weight replaced by desired weight. A total of 44,457 
responses were obtained for the BMI and 41,939 for the 
DBMI. 

When there are only a few distinct outcomes of x, 
binning on x is done in a natural way. For example, in 
investigating the relationship between the body mass index 
(BMI) and age, the age of the respondent was reported only 
at integral values. The solid dots in Figure 1 are the survey 
domain estimates of the average BMI ( y.) for women at 
each of the ages 18 through 65 (x,). The solid and dotted 
lines show the plot of (x) against x using bandwidths h = 
7 and h= 14 respectively. It may be seen from Figure | that 
BMI increases approximately linearly with age until around 
age 50. The increase slows in the early 50s, peaks at age 55 
or so, and then begins to decrease. On plotting the trend 
lines only for BMI and the desired body mass index 
(DBMI) for females as shown in Figure 2, it may be seen 
that, on average, women desire to reduce their BMI at every 
age by approximately two units. 
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Figure 1. Age trend in BMI for females 
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Figure 2. Age trends for females 
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In other situations it is practical to construct bins on x 
wider than the precision of the data. To investigate the 
relationship between what women desire for their weight 
(DBMI = y,) and what women actually weigh (BMI =x, ) 
the x-values were grouped. Since the data were very sparse 
for values of BMI below 15 and above 42, these data were 
removed from consideration. The remaining groups were 
15.0 to 15.2, 15.3 to 15.4 and so on, with the value of 20 
chosen as the middle value in each group. The binning was 
done in this way for the purposes of illustration to obtain a 
wide range of equally spaced nonempty bins. For each 
group the survey estimate y, was calculated. The solid dots 
in Figure 3 show the survey estimates of women’s DBMI 
for each grouped value of their respective BMI. The scatter 
at either end of the line reflects the sampling variability due 
to low sample sizes. The plot shows a slight desire to gain 
weight when the BMI is at 15. This desire is reversed by the 
time the BMI reaches 20 and the gap between the desire 
(DBMI) and reality (BMI) widens as BMI increases. 


Desired Body mass Index for females 


BMI groups 
Figure 3. BMI trend in DBMI for females 


4. PARAMETRIC VERSUS NONPARAMETRIC 
REGRESSION 


Local polynomial regression allows us to obtain non- 
parametrically a functional relation between y and x. How- 
ever, a parametric model may also be reasonable. For 
example, on examining Figure 1 showing the Body Mass 
Index against age, we might consider the parametric model 
that y has a quadratic relationship to x. We may also want 
to test in Figure 2 if the two lines are parallel, or equiv- 
alently that the difference between the Body Mass Index 
and the Desired Body Mass Index for females is constant 
over all ages. This would involve modeling the trend lines 
as second degree polynomials and testing for equality in the 
trend lines of the parameters associated with the quadratic 
term as well as the parameters associated with the linear 
term. In all cases, the question arises as to whether or not 
the data can be adequately modeled by a polynomial 
relationship between y and x. One method that we propose 
as an answer to this question is to calculate the confidence 


bands based on local polynomial regression. These bands 
can be thought of as providing a region of acceptable model 
representations. If an appropriate parametric regression line 
falls within the bands, then it provides a reasonable model 
description of the data. The 100(1 -a)% local polynomial 
regression bands are obtained by ploting 

M(X) #2. ¥V, n(x) (6) 
over a range of values of x, where z,,, is the 100(1 -a /2) 
percentile of the standard normal distribution, where 7 (x) 
is determined from (3) and where V, (m(x)) is (5) with V 
replaced by its sample estimate V. 

The parametric regression line to be tested may be 
obtained in one of two ways depending upon what sample 
information is available. If the complete sample file with 
sampling weights is available, then the standard regression 
approach in, for example, SUDAAN may be used. If only 
the binned data are available, in particular the survey 
estimates y, with estimated variance-covariance matrix V, 
then another approach is needed. 

For oS Second eats assume that m(x,) =x; ’B, 
where oa @ bee x;4) and where pr = 
(65 Pyare 6 ) is the vector of regression coefficients. For 
the finite population we assume that y, = x; ’B + €,, where 
the errors are deviations of the actual finite from the model. 
For simplicity, we assume that these errors have mean 0 and 
variance-covariance matrix o’I. Since the data are given by 
the survey estimates y, with variance-covariance matrix V, 
the operative model is 
¥ =x" B+, (7) 
where the 5, have mean 0 and variance-covariance matrix )’ = 
o’ I+V. The usual weighted least squares estimate of B is 


=. 2.6 De ye. bi Dik y; (8) 


where the i" row of X is x), i=1,...,k. In terms of data 
analysis it is necessary to replace y in (8) by its estimate )°. 
Now the survey seal of V is V so that it remains to find 
an estimate of Gi: This may be obtained through rss = 

(y-X p)" (y -X B), the residual sum of squares, by one 
of two ways. 

The first method is to approximate the expected residual 
sum of squares under model (7) and solve directly for 67. 
Upon using the expansion (I + B)! =I - B + B? - B? + 
we find 


E(rss) = (n-q-1)o? +tr(V) -tr(X7 VX(X7X)"!). (9) 


The estimate of o” is obtained on setting rss equal to the 
right hand ies of (8) with V replaced by V and then 
solving for 6”. This leads to an iterative approach to model 
fitting. An initial estimate of B is obtained Hon (8) with V 
replaced by the survey estimate V. Then o is estimated 
through (9) and a new estimate of B using y= 671+ V is 
obtained. The process is repeated until convergence is 
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obtained in the estimate of o*. If the estimate of o” is 
negative, it is set to 0. The second method for estimating 6” 
is obtaining by first treating the errors in (7) as multivariate 
normal variables. Then a profile likelihood for 6” can be 
obtained on replacing B and V by their estimates. The most 
influential term in this profile likelihood is 


r7(? 1+ V)'r, (10) 


where r =y -X(X7(o21 + V)!X)'X" (671+ V)'Y is 
the vector of residuals. An approximation to the profile 
likelihood estimate 67 is that value of 6”? which minimizes 
(10). 

To provide examples of the question of the adequacy of 
parametric regression, we examined two different variables 
in the Ontario Health Survey and their relationship to the 
body mass index (BMI). These were age and fat consump- 
tion as a percentage of total energy consumption. For age 
the binning was natural and at the precision of the recorded 
data. Age was restricted to the range of 18 to 65 years since 
the index is not applicable outside this range and age was 
recorded in years. The scatterplot of BMI against age with 
the accompanying local polynomial regression line is 
shown in Figure 1. The survey data on fat consumption in 
percentages were recorded to three decimal places. Due to 
the sparseness of the data at the extremes we looked at fat 
consumption in the range of 14 to 56% of total energy 
consumption. Further, we binned the data on the covariate 
(fat consumption) using bins 14.0 up to 14.2, 14.2 up to 
14.4 and so on; the midpoints of the bins (14.1, 14.3 and SO 
on) were used as the x,. At each bin the survey estimate y, 
for BMI was calculated. It is the binned data that appear as 
a scatterplot of BMI against fat consumption in Figure 5. 
The solid line in Figure 5 is the local polynomial regression 
line with g = 1 for BMI on fat content. As in Figure 3, the 
larger variability at the extremes reflects greater sampling 
variability due to smaller sample sizes at the extremes. 
From Figure 5 it appears that BMI increases slightly as fat 
consumption increases. Since the complete data file for the 
survey was available, regression lines for all variables were 
obtained through SUDAAN. 

In Figure 4 the solid lines are the 95% confidence bands 
based on (6) and the dashed line is the parametric second 
degree polynomial regression line. Since the dashed line 
falls near the border for women in their thirties and outside 
the bands for women in their early sixties, a second degree 
polynomial barely adequately describes the relation 
between BMI and age. Another model might be preferable. 
Figure 6 shows the same 95% confidence bands but for the 
consumption of fat as a percentage of total energy 
consumption. In this case the dotted line is the simple linear 
regression line of BMI on fat consumption. For fat 
consumption the line falls completely within the confidence 
bands so that simple linear regression appears to be an 
adequate description of the model relationship. 
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Figure 4. Confidence Bands for the Age Trend in BMI for Females 
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If the data have been binned to the precision of the data 
as in the case of age above, and if the exploratory analysis 
is complete, we can stop. The estimates and variance 
estimates obtained are equal to the estimates and variance 
estimates obtained from the raw data. This may be seen on 
examining (3). The term on the right hand side of (3) can be 
expressed as a sum over the sample of the sample weights 
times a new measurement obtained from the raw 
y-measurement times an appropriate value taken from 
e™(X!W_X) | .< W. times the total of the sample 
weights, where W, is W. with the p,’s removed. These 
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adjusted y-measurements may be fed into SUDAAN or 
STATA to obtain the required approximate variance esti- 
mate. It may be that the binning has been rougher than the 
precision of the data or that some bins have been dropped 
in the tails of the distribution of x due to sparseness of the 
data in those bins. Both of these situations occurred in 
analyzing the relationship of BMI to fat consumption. Once 
the exploratory analysis has been completed we can return 
with a final model and smoothing parameter, if a nonpara- 
metric approach is used in the final analysis, and apply to 
model to the raw data obtaining variance estimates through 
SUDAAN or STATA as necessary. Depending on the 
amount of roughness in the binning and the number of bins 
dropped due to sparseness in the data, the variance esti- 
mates obtained from the raw will be approximately the 
same as those from the binned data. 


5. FUTURE DIRECTIONS 


Like Bellhouse and Stafford (1999), this paper adapts a 
modern method of smoothing for the analysis of complex 
survey data. It represents an example of a host of regression 
techniques that could be used. To describe these we embed 
the current context in a general framework hinting at future 
work. In doing so we mimic the developments of Hastie and 
Tibshirani (1990). 

Here a smoother is said to be linear if fitted values are 
obtained by applying a matrix S to a response vector y. As 
in the case of simple linear regression for independent and 
identically distributed data, we let H = 
(X7¥'X)'X7Y! and further denote (a W. Xo? 
Xx: Ww. as S_. Both are examples of S. In addition, the 
response vector of binned means is a type of smooth 
y, =S,y, where y is the vector of all sample responses and 
where S, involves the sample weights. Also the usual 
regression context involves applying a matrix similar to H 
to the full response vector y, = H ry: So moving from usual 
regression to regressing means to local polynomial smooth- 
ing reduces to applying different smoothing matrices to y: 


H, yoHS y- S,5,Y- 
In general S, can be replaced by any smoother S and the 
methods extended to multiple covariates. 

There are many advantages to binning the response from 
both a theoretical and practical standpoint. Standard 
smoothing tools, like those found in Splus, can be applied 
without modification of the smoother due to sampling 
issues. In addition, in the case of the additive model, finite 
population central limit theorems can be invoked and issues 
like degrees of freedom, choice of smoothing parameter, 
optimizing a criterion, can be handled in the usual manner. 
In the case of multiple covariates Nisan 0 Ne CULSe OF 
dimensionality will result in sparse bins not allowing the 
use of the central limit theorem. This may be countered in 
the usual way by binning partial residuals one dimension at 


a time. Here smoothers Ss; S 5h Ln, would be used in 
a backfitting algorithm. =“ 

It is our intention to study additive and generalized 
additive models in the above manner and to introduce these 
techniques to the analysis of complex survey data. 
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Modelling Compositional Time Series from Repeated Surveys 


D.B.N. SILVA and T.M.F. SMITH’ 


ABSTRACT 


A compositional time series is defined as a multivariate time series in which each of the series has values bounded between 
zero and one and the sum of the series equals one at each time point. Data with such characteristics are observed in repeated 
surveys when a survey variable has a multinomial response but interest lies in the proportion of units classified in each of 
its categories. In this case, the survey estimates are proportions of a whole subject to a unity-sum constraint. In this paper 
we employ a state space approach for modelling compositional time series from repeated surveys taking into account the 
sampling errors. The additive logistic transformation is used in order to guarantee predictions and signal estimates bounded 
between zero and one which satisfy the unity-sum constraint. The method is applied to compositional data from the 
Brazilian Labour Force Survey. Estimates of the vector of proportions and the unemployment rate are obtained. In addition, 
the structural components of the signal vector, such as the seasonals and the trends, are produced. 


KEY WORDS: Additive logistic transformation; Compositional time series; Kalman Filter; Labour force survey; Repeated 


surveys; State space models. 
1. INTRODUCTION 


All surveys are multivariate and multipurpose, and most 
are longitudinal, repeating the same questions over time. 
There are two broad classes of repeated surveys, those with 
overlapping first stage units and those with no overlap of 
first stage units. Both designs admit a longitudinal macro- 
analysis of population aggregates but only the former 
allows a micro-analysis and the estimation of gross flows or 
some other similar unit level dynamic process. In this paper 
we explore the time series analysis of a multivariate vector 
of population aggregates, a macro-analysis, while taking 
into account the influence of the sampling errors of the 
survey using pe oe data. 

Denote by 8, = (0,,,..-,9y,,; ,)’ a vector of population 
quantities of 1 interest at time t, and assume that observations 
are made at equally spaced time intervals ¢ = 1, 2, ..., T. Let 
Y,= ipo Yuet,’ Tepresenta survey-based estimate of 0, 
based on data collected at time t. Repeated surveys produce 
time series {y,} comprising estimates of the unknown 
target series {8}. Focussing on the unknown population 
vector @,, it is natural to imagine that knowledge of 
0,,..., 8, conveys useful information about ®, but 
without implying that it is perfectly predictable from 
0,,...,0,_,. One way of representing this situation is by 
considering 9, to be a random variable which evolves 
stochastically in time following a certain time series model, 
as first proposed for univariate survey analysis by Blight 
and Scott (1973), Scott and Smith (1974) and Scott, Smith 
and Jones (1977). The survey estimates y, of 6, can then 
be written as: 


| 
© 


ys tine 3&2 (1) 


1 


ie {0,}, {y,} and {e,} are random processes and 
= (€1,5--5€y,,,,) are the sampling errors such that 
Ele |6, ve - Q and V: (ev lO) Sy) 5. 


The early work of Scott et al (1977) was concerned with 
univariate {y,} and distinguished different forms for the 
data available on {e,}. If the only data available to the 
analyst are the population aggregate estimates {y,} then 
this is termed a secondary analysis and the examples in 
Scott et al. (1977) are based on a secondary analysis of 
survey data. If the individual data records are available, 
then variances and covariances can be estimated directly 
from the data and this is called a primary analysis. In 
addition, in the case of a rotating panel survey, elementary 
estimates (based on data from a set of units that join and 
leave the survey at the same time) can be used to estimate 
the covariance structure of the sampling errors. Subsequent 
work by Jones (1980) used a primary analysis to measure 
the structure of the sampling noise whereas Binder and 
Hidiroglou (1988), Binder and Dick (1989), Pfeffermann, 
Burck and Ben-Tuvia (1989), Pfeffermann and Burck 
(1990), Pfeffermann (1991), Binder, Bleuer and Dick 
(1993), Pfeffermann and Bleuer (1993), Pfeffermann, Bell 
and Signorelli (1996), Pfeffermann, Feder and Signorelli 
(1998) and Harvey and Chung (2000) employed an 
elementary analysis. 

The time series analysis of survey data also requires that 
the signal process be modelled. In the early works it was 
assumed that {@,} was a stationary process and that {y,} 
was the superposition of two stationary processes therefore 
being itself stationary. Typically ARMA processes were 
assumed for {0,} and {e,}, and hence for {y,}. Binder 
and Hidiroglou (1988) wrote the processes in state space 
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form which led rapidly to the introduction of nonstationary 
processes for the signal {®,}, and structural models 
involving trends and seasonals have been used since then. 

The aim is to improve estimation of the unobservable 
signal and its components, but when the sampling errors are 
autocorrelated these autocorrelations can induce spurious 
trends which get confounded with the true signal trend, as 
pointed out by Tiller (1992) and Pfeffermann, Bell and 
Signorelli (1996). When the variation in the sampling errors 
is not taken into account, their autocorrelation structure may 
be absorbed into either the seasonal or the trend compo- 
nents, thus affecting the inference from the model. 

A special case of interest in repeated surveys is when the 
univariate target parameter {0 ,} is a proportion such as the 
unemployment rate. Unrestricted time series modelling of {9 ,} 
may lead to estimates outside the range 0 < 0, < 1. Wallis 
(1987) used a logistic transformation to ensure that the 
estimates were bounded, however he failed to take into 
account the survey error. Pfeffermann (1991), Tiller (1992), 
Pfeffermann and Bleuer (1993), Pfeffermann, Bell and 
Signorelli (1996) fitted state space models to unemploy- 
ment rate series taking into account survey errors but with- 
out using the logistic transformation to guarantee bounded 
estimates. 

Most surveys are multivariate and there has been little 
work in the multivariate time series analysis of survey data. 
Brunsdon (1987) and Brunsdon and Smith (1998) analyse 
multivariate data from opinion polls taking into account the 
fact that the proportions are bounded and comprise a com- 
position, but not allowing for the structure of the survey 
errors. This work provides useful insight into the modelling 
of time series of proportions. Compositional data have also 
been modelled using a state space approach, by Quintana 
and West (1988), Shephard and Harvey (1989) and Singh 
and Roberts (1992), but these authors also did not address 
the issue of modelling the autocovariance structure of the 
sampling errors when the observed compositions are 
obtained from repeated surveys. 

The motivation for this work is that many variables 
investigated by statistical agencies have a multinomial 
response and interest lies in the estimation of the propor- 
tion of units classified in each of the categories. If this is the 
case, the vector of proportions sums to one and forms what 
is known as a composition. A compositional time series is 
therefore a multivariate time series comprising observations 
of compositions at each time point. We propose a class of 
multivariate state space models for compositional time 
series from repeated surveys, which takes into account the 
sampling errors and guarantees estimates satisfying the un- 
derlying constraints imposed by compositions. The proce- 
dure employs a signal-plus-noise structural model which 
yields seasonally adjusted series and estimates of the trend 
which satisfy the underlying sum constraint. The method is 
applied to compositional data from the Brazilian Labour 
Force Survey comprising estimates of the vector of propor- 
tions of labour market status. Estimates of seasonally 


adjusted compositions, trends and unemployment rate series 
are produced. 


2. A FRAMEWORK FOR MODELLING 
COMPOSITIONAL DATA FROM 
OVERLAPPING SURVEYS 


We assume that {0} is multivariate and the components 
) form a composition, 4e., 0<0,,.<1 V m,t and 


mt 


yr) 6,,,=1. In this case y, is a vector of sample esti- 


mates, based on the cross-sectional data of time t and 
belongs to the Simplex: 


See ty Oy 1, i = le es 


M+1 


ae Vo Sukie lees 
m=) 


as in Brunsdon and Smith (1998). In addition, it is assumed 
that y, is obtained from a survey with complex design and 
overlapping units between occasions. Since each of its 
components is subject to sampling errors, y,,, can be 
decomposed as: 


T}, 


meee gi (2) 


where 9, , is the unknown population proportion assumed 
to follow a time series model, and e,, , is the sampling error. 
Considering the M +1 series simultaneously, (2) can be 
written in vector form as in equation 1. In addition, it is 
assumed that 


M+1 M+1 
SO... = any ana (3) 


m=1 m= 


which implies that ee 8 anes Oyy Vote 


A compositional time series is a sequence of vectors 
Y, = Nye > Yuet.) each belonging to S™. Aitchison 
(1986) examined the difficulties of applying standard 
methods to modelling and analysing compositions and 
suggested the use of transformations to map compositions 
from the Simplex §™ onto R™. One such transformation is 
the additive logratio transformation (a,,), defined in 
Aitchison (1986, page 113), which was first adopted in a 
time series context by Brunsdon (1987, page 75). The 


transformation is given by v,=4,,(Y,)=(Vyys 5 Vy) 
with 
Yt 
vied) 310g Witeeeeasty |e wntie Ulin 5 MET Nia: (4) 


YaM+1,t 


where log denotes the natural logarithm. Note that 
M 

Yer. = 17 Xn =1¥%mp2 Sometimes called the fill-up value, 

is used as the reference variable or category. The inverse 

transformation, known as the additive logistic transforma- 

tion, is given by y, = Ay (¥,) = (Yyyo--> Yep, ) Such that 
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exp(v,,,) - 
a SW eM ets 
1+ >) exp(y,,) 
yas mnie (5) 
= PST: 


M 
ie a ye exp(V,,) 
te 


The state space modelling procedure for compositional 
time series is invariant to the choice of the reference 
variable (Silva 1996), and so any element y,,,# Yy,,., Of Y, 
can be taken as the reference variable when applying the 
additive logistic transformation to the vector of survey esti- 
mates. When the logratios v, are normally distributed the 
M+1 — part composition has an additive logistic normal 
distribution as defined in Aitchison and Shen (1980). For 
compositional time series, Brunsdon (1987) recommended 
the use of Vector ARMA models (Tiao and Box 1981) for 
the transformed series. 

We propose a procedure that not only provides pre- 
dictions and filtered estimates that are bounded between 
zero and one and satisfy the unity-sum constraint, but also 
improves the estimation of the unobservable signal and its 
components, taking into account the sampling error. 

Following Bell and Hillmer (1990), the model in (2) can 
be rewritten as: 


emt 
sy a Oy Cee ea = Orem (6) 
On 
with 
mt ~ 
rid a NS oi eer ill Roi A Gs Maa Ai) (7) 
Whey 
where uw, =¢,,,/0,,, represents the relative sampling error 


of the estimated proportion. 

Applying the additive logratio transformation defined in 
Aitchison (1986, page 113) to the vector y,, with 
components given in (2), produces a transformed vector 
¥,=Ay(¥,) = (ye -sYy,)’ contained in R™. If y,,, , is 
used as the reference variable, the transformed vector has as 
its m” component: 


mt ee log OL Unt | 
YM+1,t Oret.2 Umer, t 


6 Uu 
= log] —™ “| = | m=1,..,M. (8) 
Onet.t Une t 


From (8), a vector model for the transformed series can 
be written as: 


mt 


LS 
I 
o 
0a 
a ea 
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with vA =,55 vee Vai. 6; = Gy, 7 rO us): and 
ia = (), 5-5 Cur)» where Mg 108 (Yn! Yager.) 
Bo % log cor 19 uye1,1) and Emt a log (u,,/Ury.1 ,)> for 


=1,..,M. Note that model (9) has the same form as 
model (1). 

To describe the survey data, model (9) must incorporate 
time series models for both {@; } and {e; }. Hence a multi- 
variate model for the transformed data will depend on the 
form of the time series models for {0; } and {e; }. 

The state space formulation for compositional data is 
examined in section 3, the model estimation is considered 
in section 4 and is illustrated using Brazilian Labour Force 
Survey data in section 5. 


3. MODELLING THE TRANSFORMED SERIES 


Our approach is based on assuming that the transformed 
series v, =a,,(y,) has the signal plus noise structure in 
equation 9. We propose structural time series models for 
{ 0, }, as in Harvey (1989), and vector ARMA models 
(Tiao and Box 1981) for {e; }. 

The transformed signal process {@;} is assumed to 
follow the multivariate basic structural model, with each of 
the components {8° ,} following a basic structural time 
series model (BSM) with possibly different parameters 
across the series. The cross-sectional relationship between 
the series is accounted for by the correlation structure of the 
system disturbances. The model for {0,,,}, m=1, 2,...,M, 
is then given by: 


Ont ash i Sint Int? 
* = * * (1) 
Lint Y, Lin,t-1 Ree + Nn 
(10) 
: . (*) 
Rit Re, t-1 * “me? 
Ul ie 
* * Ss 
St Siege + Mme? 
j=l 
where L,_, is the trend/level component of the > signal Oke 


is the corresponding change in the level, S,”, is the seasonal 
component and J, is an ou pom) onent. For each 
component, the disturbances i fee ee and the 
irregular J, are assumed to be mutually uncorrelated 


mt? 


normal deviates with mean zero and_ variances 
2 2 2 2 
Gin? Om > Om? Om , respectively. That is, the Mx1 vector 


disturbances 4 , a ) |g? and I;, are mutually uncorre- 


lated in all time periods. In addition, the irregulars 
Los Lie hy with m#j, h=--,-2,-1 ,0,1,2,~, are assumed 
to be correlated when h =O, but uncorrelated for h #0 and I; 


has covariance matrix )’, . The same happens with the 
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system disturbances 1“), coe a=l,r,s, which are also 


correlated when h=0, but uncorrelated for h#0, with 
covariance matrices )’,, )’,,)’,. At each time #, the corre- 
lation structure between the components of the composition 
is summarized by )’, and a block diagonal matrix with the 
blocks being )’,, )),,)),- Note that the relation between the 
series arises via the non-zero off-diagonal elements of the 
disturbance covariance matrices. The multivariate model 
(10) for {0; } has the following state space formulation: 


OF = HO Oe, 

11 
yr T® (8) G® (8) eee 
a . a Ul ee 


where H® = [1010000000000]@/,,, 


* * 


(8) _ * * * * * * 1 
a, oa [L,, oy Ly, eS Ru Si: we Sut ay Sy 1-10 ee Su,1-10] 
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The transformed survey error process {e, } is assumed 
to follow an M-dimensional vector autoregressive moving 
average process (VARMA), defined by M(B)e, = O(B)a, 
with mean vector E(e, ) =0 and 


OB)" b 20 bia ee OB 
(BCS I) DB aera Deh’, 
where ®,,..,® ,0,,..,0, are coefficient matrices and a, 


is an M-dimensional white noise random vector with zero 
mean and covariance structure: 


EG 508 en 
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The cross-covariance matrix function for the VARMA 
process {e, }, (see Wei 1993, page 333), is given by: 


T,.(h) = COV(e, ,,e,) = Ele yre,”), 


where {I°,-(A)}.; = mj (A) = COV(En sn» @1)» and the 
cross-correlation function for the vector process is defined 


as: 


Pie= Duel 0) Dice 


where 
D,. = diag(y,.;,©),--»¥,.wu))- 


The state space representation of VARMA models can 
be found in Reinsel (1993, section 7.2). The separate 
models for the transformed signal and sampling errors can 
be cast into a unique state space model, see Silva (1996, 
Chapter 8) for details. 


4. ESTIMATION FROM THE 
TRANSFORMED DATA 


As in previous sections, we distinguish between the 
estimation of the structure of the surveys errors, the noise, 
and the estimation of the covariances of the basic structural 
model. Once these are obtained, we employ the Kalman 
filter to get estimates of the trend and seasonals which 
determine the signal. Before carrying out the signal extrac- 
tion, the VARMA model for the survey errors must be 
identified. 

The model specification for the error process {e, } 
depends on the sampling design, particularly on the level of 
sample overlap between occasions, and also on data availa- 
bility. Many authors have considered the problem of 
modelling the sampling error process in a univariate frame- 
work, see, for example, Scott and Smith (1974), 
Pfeffermann (1989, 1991), Bell and Hillmer (1990), Binder 
and Dick (1989), Tiller (1989, 1992), Pfeffermann and 
Bleuer (1993), Binder, Bleuer and Dick (1993), 
Pfeffermann, Bell and Signorelli (1996) and Pfeffermann, 
Feder and Signorelli (1998). However, in all of these cases 
the authors are working with the original data instead of the 
transformed data. After transformation, it is difficult to 
carry out a full primary analysis based on individual 
observations, see Silva (1996, Chapter 7). 

Many repeated surveys are based on a rotating panel 
design in which K panels of sampling units are investigated 
at each survey round (time point) and panels are replaced in 
a systematic manner, according to the rotating pattern of the 
survey design. In these surveys, elementary design unbiased 
estimates yo k=1,--, K, for the population parameter 0, 
can be obtained from each rotation group. A rotation group 
is a set of sampling units that joins and leaves the sample at 
the same time. 
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In a two-stage survey, in which the primary sampling 
units (enumeration areas) remain in the sample for all 
survey occasions, the replacement of panels of households 
(second-stage units) is ordinarily carried out within geo- 
graphical regions defined by mutually exclusive groups of 
enumeration areas. Note that a survey with K panels 
produces K streams of estimates, where a stream is a time 
series of all sample estimates based on samples from the 
same enumeration area, that is, is a time series of 
elementary estimates. 

Pfeffermann, Bell and Signorelli (1996) and 
Pfeffermann, Feder and Signorelli (1998) show how to esti- 
mate the autocorrelation of the sampling error process for 
univariate data, before transformation, using the so-called 
pseudo-errors, defined as: 

Joigie dialect (12) 
where y, = 1/K ee 1 ye If there is no rotation bias, it 
follows that: 


5 2 e8_ (13) 


: k : . 
thus contrasts in yf? are contrasts in the panel sampling 
errors @, 

For the compositional case we apply, for each cog el 


= estimate, the transformation v, a, (y\ ae 


(v{?,..,.v2)’ which has as _ its nt Ceoonene 
(m=1,..,M): 
(k) (k) 
(k) Yt i ke mt 
Ving = lOg 5 = log 5 + log i (14) 
M+1,t oh Uy+i,t 


From (14), a vector model for the k™ series of 
transformed elementary estimates can be written as: 


via = OF + ek, (15) 
with pO (cos. et and ¢” = log (us fig a 
for (m = 1,...,M). Hence, from (15), M-dimensional time 


series of transformed pseudo-errors can be constructed from 
deviations of the transformed rotation group estimates about 
their overall mean. The transformed pseudo-errors for the k™ 
rotation group are defined as: 


ee =(é *(k) aoe (k) 


ay gee9 Mt Vv, wade 
ee (k) (k) , 
ie Seen Vs as) 
where v, = 1/K Se Hoke Note, in addition, that 
Pa s(k) os 
=e, -e 
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From (14) and (15), it becomes clear that the framework 
introduced by Pfeffermann, Bell and Signorelli (1996) can 
also be applied to the transformed model. 

The cross-correlation matrices of the transformed 
sampling errors can be obtained by averaging the cross- 
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covariances matrices of the transformed pseudo-errors as 


follows (for details see Silva 1996, Chapter 7): 
=Aliy3 S17 
(17) 


. (k) 
2b Sig So 


where 
~+(k) a ir) 
Le l= COVE. 6,.) = BC CE AN, 
with 


me (k)y 2% (R) (k) 
Em, t-h? eit Mee pees, 


{e (A)},,; = COV(é 


and 
(k) 
Ds. =diag (¥", (0),..¥-mm (0). 


Once the correlation matrices P,,(h), h=1,2,... have been 
estimated, a VARMA model to represent the transformed 
survey error process can be selected and estimates of the 
respective parameter matrices can be computed, provided 
the series of transformed pseudo-errors are available. Then, 
as described in section 3, a state space model for 
representing the transformed signal and sampling errors can 
be defined and the Kalman filter equations can be used to 
get filtered and smoothed estimates for the unobservable 
components. The application of the Kalman Filter requires 
the estimation of the unknown hyperparameters (the 
covariance. matrices )) 5.2305 Yipee). ,) 1 and ithe 
estimation of the initial state vector and the respective 
covariance matrices. 

Having addressed the issue of how to model the survey 
estimates in a compositional framework and how to identify 
the time series model for the transformed sampling errors, 
the following section presents the results of an empirical 
study using compositional data from the Brazilian Labour 
Force Survey. 


5. MODELLING COMPOSITIONAL TIME 
SERIES IN THE BRAZILIAN LABOUR 
FORCE SURVEY 


The Brazilian Labour Force Survey (BLFS) collects 
monthly information about employment, hours of work, 
education and wages together with some demographic 
information. It classifies the survey respondents, aged 15 
and over, according to their employment status in the week 
prior to the interview into three main groups: employed, 
unemployed and not in the labour force, following the 
International Labour Organization (ILO) definitions. The 
survey targets the population living at the six major 
metropolitan areas in the country. The BLFS is a two-stage 
sample survey in which the primary sampling units (psu) 
are the census enumeration areas (EA) and the second-stage 
units (ssu) are the households. The primary sampling units 
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are selected with probabilities proportional to their sizes 
and then a fixed number of households is selected from 
each sampled EA by systematic sampling. All household 
members within the selected households are enumerated. 
The primary sampling units remain the same for a period of 
roughly 10 years (as in a master sample). New primary 
sampling units are selected when information from a new 
population census becomes available. 

In addition, the BLFS is a rotating panel survey. For any 
given month the sample is composed of four rotation groups 
of mutually exclusive sets of primary sampling units. The 
rotation pattern applies to panels of second-stage units 
(households). Within each rotation group a panel of house- 
holds stays in the sample for four successive months, is 
rotated out for the following 8 months and then is sampled 
again for another spell of four successive months. Each 
month one panel is rotated out of the sample. The substi- 
tuting panel can be a new panel or one that has already been 
observed for the first four months period. Note that the 4-8- 
4 rotation pattern induces a complex correlation structure 
for the sampling errors over time and that there is a 75% 
overlap between two successive months. 

The empirical work was carried out using data from the 
Sao Paulo metropolitan area covering the period from 
January 1989 to September 1993 (57 observations). The 
quantities of interest are the proportions of employed, 
unemployed and not in the labour force, and also the unem- 
ployment rate. Using the monthly individual observations, 
the series of sample estimates and their respective estimated 
standard errors were computed using data of each specific 
survey round and standard estimators. For each month, two 
sets of estimates were obtained. The direct sample 
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estimates, derived from the complete data collected at a 
given month and four elementary estimates, each based on 
data from a single rotation group. The panel estimates are 
used to estimate the sampling error autocorrelations and to 
help to identify the time series model for the sampling 
errors. 

In this study the observed composition has M+1=3 
components and the time series is defined as the sequence 
of vectors y, = (¥1,5¥2,>¥3,)'» where: 


y,, iS the estimated proportion of unemployed persons 
in month f; 


y>, is the estimated proportion of employed persons in 
month f¢; 


y3, 1S the estimated proportion of persons not in the 
labour force in month f. 


The model for the BLFS must incorporate the special 
features of the data. Firstly, it is a compositional time series 
belonging to the Simplex S* at each time t. Secondly, the 
time series are subject to sampling errors. Following the 
approach in section 2, we first map the composition onto 
R* using the additive logratio transformation with y,, as 
the reference category. As y, is a vector of sample esti- 
mates, it can be modelled as in equation 1 and the vector 
model for the transformed series is given by equation 9. 
Then, the transformed composition is modelled using a 
multivariate state space model that accounts for the auto- 
correlations between the sampling errors. Finally, the model 
based estimates are transformed back to the original space. 
Figure | displays the series of transformed compositions. 
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DATE 


LOG (Employed/Inactive) 
----®---- LOG (Unemployed/Inactive) 
Vertical lines = September 89 - September 93 


Figure 1. Brazilian Labour Force Series - SAO PAULO Transformed Compositions 
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The model for the transformed sample estimates v, is 
composed of a bivariate model for the transformed signal 
@;, describing how the transformed population quantities 
evolve in time, and a bivariate model representing the time 
series relationship between transformed sampling errors e,. 
The transformed signal process { 8; } is assumed to follow 
the bivariate basic structural model (equation 11) as 
described in section 3. As mentioned before, a VARMA 
model to represent the sampling error series was used. The 
correlation structure of the transformed sampling errors was 
estimated using the transformed pseudo-errors as in 
equation 16. In addition, estimates of the partial lag 
correlation matrices for {e,; } were computed using a 
recursive algorithm provided in Wei (1993, pages 359-362). 
A program in SAS-IML which gives the corresponding 
schematic representations (Tiao and Box 1981) and a 
statistical test to help establish the order of the vector 
process was developed. The form of the correlation 
matrices and the results for the statistical test, available in 
Silva (1996), indicate that a VAR(1), a VAR(2) or a 
VARMA(1I,1) model could be used to represent the 
transformed sampling error process. In the event, the 
VARMA(1,1) was chosen because it yields smaller 
standard errors for estimates of the unemployment rate. The 
parameter estimates for this model were obtained from the 
relationship between the cross-covariance function and the 
parameter matrices given in Wei (1993, pages 346-347). 
The VARMA(I,1) fitted for {e; } is given by: 


ej, 0.7347 0.2414]] 14-1 
Veh O0224og0 2072 het 
0.3162 0.25901] 41.,-1 a, 
_ + 
-0.7666 -0.2749 A, 54 ay, 
with 
3 0.0001723 0.0003476 (18) 
4 |0,0003476 0.0051660 


Having put the combined model for the transformed survey 
estimates into the state space form, the Kalman Filter 
equations can be used to get filtered and smoothed esti- 
mates for the unobservable components. Note that the 
estimation of the model for the transformed sampling errors 
(equation 18) was implemented outside the Kalman Filter. 
The application of the Kalman Filter requires the estimation 
of the unknown hyperparameters (the covariances), the 
initial state vector and respective covariance matrix. 
Assuming that the disturbances 9°, a, and J, are normally 
distributed, the log-likelihood function of the (transformed) 
observations can be expressed via the prediction error 


211 


decomposition (for details see Harvey 1989). Estimates for 
the model covariances were obtained by maximum like- 
lihood, applying a quasi-Newton optimization technique. A 
computer program to implement the maximization proce- 
dure was developed using the optimization routine NLPQN 
from SAS-IML. 

The initialization of the Kalman filter was carried out 
using a combination of a diffuse and proper priors. 
Following this approach, the non-stationary components 
(a®)’ of the state vector were initialized with very large 
error variances and the respective components of the initial 
state vector were taken as zero. The stationary components 
(€,,@5, )’ were initialized by the corresponding uncondi- 
tional mean and variance. 

When fitting the model, the estimated covariance 
matrices obtained for the slope and seasonal components 
were very small and could be set to zero. This implies that 
the seasonals are assumed to be deterministic and that the 
slope is assumed to be fixed, giving rise to a local level 
model with a drift and non-stochastic seasonals for the 
signal. Indeed, as pointed out by Koopman, Harvey, 
Doornik and Shephard (1995, page 39), when the number 
of years considered in the analysis is small, it seems reason- 
able to fix the seasonals since there is not enough data to 
allow the estimation of a changing pattern. The fact that a 
fixed seasonal pattern is validated by the estimation process 
is a satisfactory feature of the modelling procedure. In 
addition, the estimated covariance matrix of the irregular 
component was also found to be very small (and hence 
undetectable) in comparison to the sampling error and so, 
as expected, in the presence of relatively large sampling 
errors, there was no need to include irregular components 
in the model for the transformed signal. The parameter 
estimates and respective asymptotic errors (displayed in 
parenthesis) are presented in Table 1. 


Table 1 
Estimates for the Hyperparameters and Standard Errors 
Model Y elon (2) v,=k,-2, 
BSM + 2.78 0.12 oe 
VARMA (1,1 
(1.1) (0.91) avo 
EOS 387.0 


(Sa) CHAO) 
(1) 
(1) Local level model with drift and fixed seasonals for the 
signal. 
(2) Upper-triangular contains correlation. 


To evaluate the model performance, empirical distri- 
butions of the standardized residuals were compared with 
a standard normal distribution to verify the assumption that 
the innovations (Peay ra) are normal deviates. Exami- 
nation of corresponding normal plots revealed no departure 
from normality. In addition, we also computed the auto- 
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correlations of the innovations, which were close to zero, 
further validating the model. 

Predictions for y,,, and estimates for 0, are computed 
by applying the additive logistic transformation (equation 
5) to predictions ¥ ee and smoothed estimates 6° tT for the 
transformed series and signal, respectively. This transfor- 
mation maps these estimates onto S$’, guaranteeing that 
they satisfy the boundedness | constraints. 

Unfortunately, although jie 7 and Ss p17 can be obtained 
from 6° t|T? it is not Seach ward to obtain estimates for 
the structural unobservable components of the orginal 
signal 8, such as b 7 and Ss 17? However, if a multipli- 
cative model with no icrentlar component is assumed for 
{0 .}, such that: 


91, 7 LS), a 0, 5 L,,S,, , 95, L,, 53, (19) 
where Land S_, for m=1,2,3 represent the trend 
and seasonal components of the unobservable signals, then 


applying an additive logratio transformation to 8, results in: 


log(®.. /@,,) = 1 Ent Sms 
(0) = Oo 2A eee be Sea 
pene ‘ L248 


3 


L St 
= Jog] —= | + log) —“\},m-= 1,2. (@0) 
L,, 5, 


This can be rewritten as: 
a faes aal Sonal aah de (21) 


with L,, = log(L,, /Ly,) and ,S,, =Jog(S., /S,,). 
Thus, the use of a basic structural model for { 6°} 
corresponds to the case in which the underlying model for { 8,} 
decomposes the original signal into its trend and seasonal 
components in a multiplicative way. For deriving estimates, 
either filtered or smoothed, for L,, note that: 

exp(L;,) exp(Ly,) 


3 Dede: = oye rae | 28 


To recover L,,, L,,, L,,, im (22), it is necessary to 
assume an explicit relationship between these unobservable 
components based on model (19). By doing this, a third 
equation can be added to the system in (22) and an estimate 
of the original series components can be obtained. Note that 
the system has three unknowns for just two equations. In 
this case, it is quite natural to assume that the level compo- 
nents sum to one across the series, being also bounded 
between zero and one. Hence, trend estimates for the 
original series can be obtained solving : 


EXPL Lae) se oe 
XDA Ly Mule g SPM Bae (23a) 
L,,+L,,+Ls, i I, 


which results in 


exp(L. 
bis Eire i raid 
m 2 
Lit, Serexe(Le 
k=] ; (23b) 
L;, 7 2 


1 + D7 exp(Ly,) 
k=l 
As there is no irregular component in model (19) the 
seasonally adjusted figures are given by the trend estimates 
in (23). Therefore, the smoothed estimates for the trend of 
the original series of proportions are obtained by applying 
the additive logistic transformation to 1 1|T° Consequently, 
estimates for the seasonal components of the original 
proportions can be computed as : 
Sotami= Cuenete 


m, t|T m, t|T? 


ae m=1,2,3. 


For labour force surveys, an important issue is the 
estimation of the unemployment rate series (as opposed to 
unemployment proportions) and also the production of the 
corresponding seasonally adjusted figures. Recall that 0,, 
and 0,, represent the unknown population proportions of 
unemployed and employed people, respectively. Using 
these proportions, the unknown unemployment rate at time 
is t defined as 


Based on model (11), trend estimates for the 
unemployment rate can be obtained by simply replacing 6, 
by Lo, t= 12 in equation 24. In conclusion, the metho- 
dology developed in this section provides signal (and trend) 
estimates that are bounded between zero and one and satisfy 
the unit-sum constraint. [t also provides estimates for the 
seasonal and trend components of series comprising ratios 
of the original proportions which is a useful feature. 

Figure 2 presents the design-based estimates and the 
model-dependent estimates for the proportion of unem- 
ployed persons, for the time period January 1989 to 
September 1993. The model-dependent estimates are the 
smoothed estimates which use all the data for the whole 
sample period. As can be seen from the graph, the signal 
estimates behave similarly to the design-based estimates 
although some of the sharp turning points in the series have 
been smoothed out. 

Model-dependent trend estimates were obtained by 
fitting the basic structural model defined for the signal 
process when sampling error variation was modelled as a 
VARMA(I,1). These estimates were compared with the 
estimates produced by the familiar X-11 procedure. Figure 
3 displays the trend produced for the unemployment rate 
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series by both methods together with the estimates obtained 
by fitting a standard basic structural model which does not 
account for sampling error variation. 

The trend produced by our model is smoother, 
suggesting that the model-dependent procedure succeeds in 
removing the fluctuations induced by the sampling errors. 
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In addition, model-dependent estimates for the seasonal 
effects of the original compositions were also obtained from 
the multivariate modelling procedure which accounts for 
two very important features of the data, namely the compo- 
sitional constraints and the presence of sampling errors. 
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Figure 2. Brazilian Labour Force Series - SAO PAULO Design Based and Model Dependent Estimates Proportion of 
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Figure 3. Brazilian Labour Force Series - SAO PAULO Trend Estimates for the Unemployed Rates Series 
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6. CONCLUSIONS 


This paper proposes a state space approach for modelling 
compositional time series from repeated surveys. The im- 
portant feature of the proposed methodology is that it pro- 
vides bounded predictions and signal estimates of the para- 
meters in a composition, satisfying the unity-sum constraint, 
while taking into account the sampling errors. This is ac- 
complished by mapping the compositions from the Simplex 
onto Real space using the additive logratio transformation, 
modelling the transformed data employing multivariate 
state space models, and then applying the additive logistic 
transformation to obtain estimates in the original scale. 

The empirical work using data from the Brazilian Labour 
Force Survey demonstrates the usefulness of this modelling 
procedure in a genuine survey situation, showing that it is 
possible to model the multivariate system and obtain esti- 
mates for all the relevant components. The results of the 
empirical work also show that smoother trends and fixed 
seasonals are obtained from a model which explicitly 
accounts for the sampling errors, when compared with 
estimates produced by X-11. In addition, because the 
model-dependent estimators combine past and current 
survey data, the standard deviations of these estimates are 
in general lower that the standard deviations of the design- 
based estimators, as shown in Silva (1996, Chapter 8). 

One drawback of the proposed procedure is that 
although confidence regions for the original compositional 
vector can be constructed based on the model-dependent 
estimates by using the additive logistic normal distribution, 
confidence intervals for the individual proportions are not 
readily available. Such intervals could be obtained from 
marginal distributions of the additive logistic normal 
distribution, but these can only be evaluated by integrating 
out some of the elements of the compositional vector and, 
as pointed out by Brunsdon (1987, page 135), this produces 
intractable expressions. 

Under a state space formulation a wide variety of models 
is available to represent the multivariate signal and noise 
processes, which is a great benefit of this modelling 
procedure. The application of the method to different data 
sets is recommended. Further empirical research should 
also consider situations where the composition lies on a 
Simplex with dimensions higher than two and/or with 
compositions evolving close to the boundaries of the 
interval [0.1]. In addition, a better insight into the 
performance of the modelling procedure may be gained by 
applying the method to simulated data, for which the “true” 
underlying models are known. The models considered here 
can also be extended to incorporate rotation group bias 
effects and explanatory variables. 
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